### scikit-learn: machine learning in Python (https://scikit-learn.org/stable/)

## Installing

Choose one of the following commands:

`conda install scikit-learn`

`pip install -U scikit-learn`


In [None]:
import sklearn

In [None]:
sklearn.__version__

### Initial imports

In [None]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
%matplotlib inline

### Loading dataset

In [None]:
# From csv
csv_data = pd.read_csv('cal_housing.data')
csv_data.head()

In [None]:
# From sklearn
from sklearn.datasets import load_boston
boston = load_boston()
print(boston.DESCR)

In [None]:
type(boston.data)

In [None]:
df = pd.DataFrame(boston.data)
df.head()

In [None]:
df.columns = boston.feature_names
df.head()

In [None]:
df['PRICE'] = boston.target
df.head()

### Plotting data

In [None]:
# Checking pairwise relationships
sns.pairplot(df)

In [None]:
# Checking target variable distribution
sns.distplot((df['PRICE']))

In [None]:
# Checking correlation
print(df.corr())
sns.heatmap(df.corr())

In [None]:
plt.rc("figure", figsize = (12, 8))
sns.heatmap(df.corr(), annot = True)

### Splitting dataset

In [None]:
from sklearn.model_selection import train_test_split

X = df.drop('PRICE', axis = 1)
y = df['PRICE']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 100)

### Regression models

#### Linear regression

In [None]:
from sklearn.linear_model import LinearRegression
reg1 = LinearRegression()
reg1.fit(X_train, y_train)

In [None]:
print('Intercept: {}'.format(reg1.intercept_))
pd.DataFrame(reg1.coef_, X.columns, columns=['Coeff'])

In [None]:
y_pred = reg1.predict(X_test)
y_pred

In [None]:
figure, axes = plt.subplots()
axes.scatter(y_pred, y_test)
axes.set_xlabel("Predicted")
axes.set_ylabel("Real")
axes.set_xlim([0, 60])
axes.set_ylim([0, 60])

# sns.scatterplot(y_pred, y_test)

In [None]:
sns.distplot((y_pred - y_test), bins = 40)

## Assignment

Send the assignment to lmfh@cin.ufpe.br.

With the `tips` dataset:
1. Plot the distribution of the target variable (tips);
2. Plot the heat map of the correlation of the variables;
3. Generate a model, other than LinearRegression (choose any other model from https://scikit-learn.org/stable/supervised_learning.html#supervised-learning), to predict the tips variable;
4. Show the scatter plot between the predicted and real values and the distribution plot of the difference between the predicted and real values.