## Predicting the weight of person using Linear Regression

#### Importing the packages

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from Questions import Questions
questions = Questions()

In [None]:
df = pd.read_csv('weight-height.csv')

#### Assessing the dataset

- Checking the first rows
- Looking for missing values
- Looking for duplicates
- Shape of the dataset

In [None]:
df.head()

I've done some research and based on the values that are presented in the variables Height and Weight, they're using inches and pounds, respectively, as the unit of measure.

In [None]:
df.shape

The dataset has 10000 rows and 3 columns.

In [None]:
df['Gender'].value_counts()

- The Gender variable is balanced.

In [None]:
df.info()

- There aren't missing values in our dataset
- The types of the variables are correct.

In [None]:
df.duplicated().any()

- There aren't duplicated rows in the dataset.

In [None]:
sns.pairplot(df, hue='Gender')

Seaborn is great for exploratory data analysis because with few lines of code you can have meaningful plots that makes our life easier.

- In this case for example, you can see that there is a linear relationship between the variables height and weight.

- Another thing that you can see more clearly by looking at the density plot is that the gender is correlated with the variables height and weight. In the weight variable, for example, the mode is, approximately, 155 pounds for the women and 200 pounds for the men in our sample.

So, the gender and the weight are variables that are suited for creating a linear regression model to predict the weight of a person.

### Data pre-processing

- Transforming the dataset into two numpy arrays (X and y).
- Splitting the dataset into training and test set
- Transforming the variable Gender into a dummy / binary variable

#### Creating the X and y arrays

In machine learning, usually, is a good practice to name matrices as uppercase letters and vectors as lower case letters.

In [None]:
# To create the array X, just drop the target variable (Weight).
# and assign the returned values to a new array

X = df.drop('Weight', axis=1).values

# The method .values() is used to transform a pandas series
# or DataFrame into a numpy array.
y = df['Weight'].values

print('Matrix X: ', X, 'Shape:', X.shape, sep='\n')
print()
print('Vector y: ', y, 'Shape:', y.shape, sep='\n')

#### Why you should split your dataset into a training and a test set?

Dataset splitting is very important because of the concept of generalization. The model is only useful if it performs well into data that this model haven't seen yet.

To understand this, I want you to imagine that you're studing for an exam. In order to be able to get a good score, you have to learn how to solve the problems, not memorize the answers of the questions that you've used to learn. The same is applied to machine learning models, we must guarantee that our model is learning the correlations of our data, not memorizing it, because in this way, our model will perform well in unseen data.

But, how do we get to see how our model will perform well in unseen data, if we don't have this data yet?

The people who have been working with data for a long time, has created a solution for this.

Basically, you split your dataset into a training and test set. Your model is trained in the training set, and you use the parameters that have been learnt by the model to predict the values of the target variable using the test set as parameter.

After that, you evaluate your model by comparing the predicted values with the observed values.



#### Splitting the dataset

In [None]:
# The random state is a very important parameter because it creates 
# a seed, that if some other person is trying
# to reproduce your model, if the value of the random_state
# is the same, all else being equal, he'll get the same results as you.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    random_state=1, 
                                                    test_size=0.25)

#### Categorical variables

Before creating a ML model, you must transform your categorical variables into numbers.

In the case of this dataset, we're dealing with a nominal categorical variable with two categories. So, we must transform the variable into a binary variable.

1. Pandas:
    - pd.get_dummies(drop_first=True)
2. Numpy Arrays:
    - LabelEncoder()
    - LabelBinarizer()
    
One important thing to say about data pre-processing, is that you have to be careful about Data Leakage. Data Leakage is when you some information outside of the training set is used in the model that is being trained.

The Sklearn Library minimize this because in most of the preprocessing methods, there is two methods: 
- fit_transform()
- transform()

The fit_transform is used to save the parameters of the X_train and, at the same time, it transforms the X_train in the desired format. The transform method uses the parameters that were saved by the fit_transform method and transforms the test set into the same format as the X_train.

In order to understand this, imagine that you have some missing values in a categorical variable and you want to use the sklearn object Imputer to impute those values with the mode of the variable. Sklearn will check the most frequent value in this variable on the training set and will transform the training set. Then, when you use the transform method on the test set, he will take the mode of training set and will replace the missing values of the test set using this value, even if the mode of the test set is different from the training set.


In [None]:
from sklearn.preprocessing import LabelEncoder
label = LabelEncoder()
X_train[:, 0] = label.fit_transform(X_train[:, 0])
X_test[:, 0] = label.transform(X_test[:, 0])

In [None]:
X_train[:, 0]

#### Exercise One
- Fit a linear regression model, make the predictions on the test set and calculate the root mean squared error.

In [None]:
# Type your code here


In [None]:
# Run this cell to check your results
questions.question_one()

#### What is the weight of woman with 75 inches

In [None]:
# Type your code here

In [None]:
# Run this cell to check your results
questions.question_two()

#### Using the default parameters, create a KNN model and evaluate it using RMSE.
- The documentation for creating a KNeighborsRegressor can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html)

In [None]:
# Type your code here

#### What is the MSE for the model, using KNNs?

In [None]:
# Run this code 
questions.question_three()

As you may have seen in the documentation of KNeighborsRegressor, the default value for the number of neighbors is 5. What happens if you place the parameter n_neighbors=1 in the model?

##### What is the performance of the model in the training set?

       - Hint: Just replace the value of X_test for the value of X_train, to get the
       predictions of the model on the training set.

In [None]:
# Type your code here

In [None]:
# Run this code to check if your answer is correct
questions.question_four()

##### What is the performance of the model in the test set?


In [None]:
# Type your code here

In [None]:
questions.question_five()

In [None]:
questions.question_six()

#### Checking your overall results

In [None]:
# Run this cell
questions.print_results()