## Lab 2 - Linear regression for missing values

Group members:
- Name (ID): Sai Sahas Elluru (0753808)
- Name (ID): Hari Sai Palem (0747511)
- Name (ID): Siddharth Singh (0756590)

### Academic integrity statement

*Replace the underscores below with your names, acknowledging that you have read and understood the statement in the context of St. Clair College’s Academic Integrity policies.*

We, Sai Sahas Elluru, Hari Sai Palem & Siddharth Singh, hereby state that in preparing this lab for submission for grading, we have abided by the College’s academic integrity policies, and that all work presented is our own.

### Overview

In this lab, the main objective is to use a linear regression model to make predictions that you can use to fill in missing values in a dataset. The procedure is the same, however, you are using one of the features as the "target" instead of what you may normally think of as the target for that particular dataset. By the end of this lab you should have:

- gained experience manipulating dataframes with Pandas
- an initial understanding of how missing data is represented
- applied a linear regression model to fill in missing data   

### Grading

This lab will be graded as follows:
- 50% for comments/text
    - Half of the lab grade will come from an assessment of the comments/text included in your Jupyter notebook submission
        - The comments/text should explain clearly what you are doing and why it's necessary to achieve the objective
        - You should think of the comments/text as if you were creating a tutorial/blog to guide someone through your work 
- 50% for code
    - Half of the lab grade will come from an assessment of your code
        - The code in the notebook should use base python, NumPy, Pandas, sklearn, and/or matplotlib. 
        - All code cells should run error free
        - The code does not have to be optimized or pretty: it needs to be functional for the specific task

### What to submit

You should submit the following:
- a well-commented Jupyter notebook
- the original dataset used as a .csv file
    - if it did not come as a .csv file, you can write to a .csv from Pandas using `.to_csv()`

### Instructions

Please execute the following steps using a mixture of base python, NumPy, sklearn, Pandas, and/or matplotlib:

1. Find a dataset
    - I would suggest looking [here](https://archive.ics.uci.edu/ml/datasets.php) for **regression** datasets
    - The dataset for this lab does not have to be complicated, but it should meet the following criteria:
        - have at least 100 samples/rows
        - have at least 4 numeric features
    - if necessary, categorical features can simply be dropped from the dataset
2. Import the data as a Pandas dataframe
    - Depending on the data format, you may need to consult this [page](https://pandas.pydata.org/pandas-docs/stable/reference/io.html)
3. Verify that your data has no missing values
    - If it does have missing values, drop them from the dataset but be sure that your dataset still meets the criteria of *Step 1* above
4. Choose a single, numeric feature (not the target)
    - Replace approximately 15% of the values of this feature with `nan`, which means "not a number" and is one way to represent missing data
5. Split your dataset into 2 dataframes
    - Dataframe 1: has all `nan` values for the feature chosen in *Step 4*
    - Dataframe 2: has no `nan` values for the feature chosen in *Step 4*
6. Use *Dataframe 2* to create a linear regression model to predict the feature chosen in *Step 4* (not the usual target)
    - Split the data
    - Scale the data
    - Create the model
    - evaluate the model on the train and test sets
7. Use the model you created in *Step 6* to predict the missing values in *Dataframe 1*
    - At the end of this step, *Dataframe 1* will have the `nan` values replaced with the predictions from the model you created in *Step 6*
8. Create a final dataframe by combining *Dataframe 1* and *Dataframe 2*
    - This dataframe should have no missing values
9. Create a k nearest neighbours regressor (`k = 3`) for the dataframe you created in *Step 8*
    - Follow the usual procedures
10. Create a k nearest neighbours regressor (`k = 3`) for the original dataframe (from *Step 2* and maybe *Step 3*)
    - Follow the usual procedures
11. Is there any significant performance difference between *Step 9* and *Step 10*?


### Visualizing the process

<img src="Lab_2_sequence.png" width=600 align="center">

### Standard package import

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score
from sklearn.neighbors import KNeighborsRegressor 


### We imported the following packages for:

- Numpy: It contains a multi-dimensional array and matrix data structures. It can be utilised to perform a number of         mathematical operations on arrays.

- Pandas: It is used for data analysis and data manipulation. 

- train_test_split: It is used for splitting the data in train and test datasets.

- StandardScaler: It is used for scaling the data to perform the modeling. It scales the data by subtracting the mean and   then scaling to unit variance.

- LinearRegression: Linear regression models are used to show or predict the relationship between two variables or           factors.

- mean_squared error: To calculate the deviation of the estimators

- mean_absolute_error: To measure the errors between predictors and observations

- r2_score: To calculate the variance in the dependant variable

- KNeighborsRegressor: Regression based on k-nearest neighbors. The target is predicted by local interpolation of the       targets associated of the nearest neighbors in the training set.  

### Step 2

In [3]:
automobile = pd.read_csv('Automobile.csv')

print(automobile.describe())
print()
automobile.drop(automobile.loc[automobile['horsepower']=='?'].index, inplace=True)
automobile["horsepower"] = automobile["horsepower"].astype(str).astype(int)
automobile.drop(automobile.loc[automobile['peak-rpm']=='?'].index, inplace=True)
automobile["peak-rpm"] = automobile["peak-rpm"].astype(str).astype(int)
print(automobile.dtypes)
print()
auto = automobile[['symboling', 'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-size', 'compression-ratio', 'horsepower', 'city-mpg', 'highway-mpg', 'price']]
print(auto.head())


        symboling  wheel-base      length       width      height  \
count  201.000000  201.000000  201.000000  201.000000  201.000000   
mean     0.840796   98.797015  174.200995   65.889055   53.766667   
std      1.254802    6.066366   12.322175    2.101471    2.447822   
min     -2.000000   86.600000  141.100000   60.300000   47.800000   
25%      0.000000   94.500000  166.800000   64.100000   52.000000   
50%      1.000000   97.000000  173.200000   65.500000   54.100000   
75%      2.000000  102.400000  183.500000   66.600000   55.500000   
max      3.000000  120.900000  208.100000   72.000000   59.800000   

       curb-weight  engine-size  compression-ratio    city-mpg  highway-mpg  \
count   201.000000   201.000000         201.000000  201.000000   201.000000   
mean   2555.666667   126.875622          10.164279   25.179104    30.686567   
std     517.296727    41.546834           4.004965    6.423220     6.815150   
min    1488.000000    61.000000           7.000000   13.000000

- pd.read_csv is used to read the csv file which is 'Automobile.csv'.
- describe() command gives the statistics summary of the numerical variables.
- We have romoved special characters(?) using drop() command in horsepower and peak_rpm
- We also converted the class of the horsepower and peak_rpm to int32.
- Finally selected the continuos variables from the dataset to create our linear regression model

### Step 3

In [64]:
auto.isnull().sum()

symboling            0
wheel-base           0
length               0
width                0
height               0
curb-weight          0
engine-size          0
compression-ratio    0
horsepower           0
city-mpg             0
highway-mpg          0
price                0
dtype: int64

- The above command is used to check any null values in the dataset

### Step 4

In [37]:
split = np.random.rand(len(auto)) < 0.85
df2 = auto[split]
print(df2)

     symboling  wheel-base  length  width  height  curb-weight  engine-size  \
1            3        88.6   168.8   64.1    48.8         2548          130   
2            1        94.5   171.2   65.5    52.4         2823          152   
3            2        99.8   176.6   66.2    54.3         2337          109   
4            2        99.4   176.6   66.4    54.3         2824          136   
5            2        99.8   177.3   66.3    53.1         2507          136   
..         ...         ...     ...    ...     ...          ...          ...   
196         -1       109.1   188.8   68.9    55.5         2952          141   
197         -1       109.1   188.8   68.8    55.5         3049          141   
198         -1       109.1   188.8   68.9    55.5         3012          173   
199         -1       109.1   188.8   68.9    55.5         3217          145   
200         -1       109.1   188.8   68.9    55.5         3062          141   

     compression-ratio  horsepower  city-mpg  highw

- in the above command we are splitting the data according to the question asked. 
- df2 contains the data with 85% values, selected randomly.

### Step 5

In [51]:
df1 = auto[~split]
df1['horsepower'] = np.nan
print(df1)
df1_1 = df1.drop(columns = 'horsepower')

     symboling  wheel-base  length  width  height  curb-weight  engine-size  \
0            3        88.6   168.8   64.1    48.8         2548          130   
6            1       105.8   192.7   71.4    55.7         2844          136   
7            1       105.8   192.7   71.4    55.7         2954          136   
18           1        94.5   155.9   63.6    52.0         1874           90   
23           1        93.7   157.3   63.8    50.6         1967           90   
29           2        86.6   144.6   63.9    50.8         1713           92   
31           1        93.7   150.0   64.0    52.6         1837           79   
33           1        93.7   150.0   64.0    52.6         1956           92   
38           0        96.5   175.4   65.2    54.1         2304          110   
45           0       113.0   199.6   69.6    52.8         4066          258   
51           1        93.1   166.8   64.2    54.1         1950           91   
57           0        98.8   177.8   66.5    55.5   

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


- we are selecting our target variable as 'horsepower' and using np.nan to put nan values in df2. df2 contain 15% of the     original dataset.
- Created a dataframe called df1_1 which does not include 'horsepower', which will be used to concat 2 dataframes in step - 7

### Step 6

In [59]:
X = df2.drop(columns = ['horsepower', 'price'])
y = df2.iloc[:, 8]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=808)
std_sc = StandardScaler()
std_sc.fit(X_train)
X_train_sc = std_sc.transform(X_train)
X_test_sc = std_sc.transform(X_test)
lr = LinearRegression()
print(lr.fit(X_train_sc, y_train))
y_pred = lr.predict(X_test_sc)
print()
print(lr.intercept_)
print()
print(lr.coef_)


LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

104.94573643410853

[ -1.27266487 -13.09190671   0.81135921   5.05306077  -0.59267695
   8.04851748  16.68328075  -1.99127361 -24.96005427   9.05849104]


- we have taken X as the other 9 independent variables and y as the 'horsepower' which is the target variable. We have removed our usual target variable 'price' for X.
- now we using train_test_split to split the data and using standardscaler to scale and transform the data accordingly and using scaled data in our linear regression model.
- data is splitted into 4 parts x,y train and x,y test. we train the data on x,y train and predict on x test and check score on y test.
- we found intercept and coef of the independent variables.
- Intercept is the mean of predicted at the mean of each predictor.
- Correlation coeficient is the measure of strength and direction of linear regression line.
- This model we are doing on the df2 (Dataframe with no NAN).

In [70]:
absolute_error = mean_absolute_error(y_test, y_pred)
print(f'MAE {absolute_error}')
mse = mean_squared_error(y_test, y_pred)
print(f'MSE {mse}')
rmse = np.sqrt(mse)
print(f'RMSE {rmse}')
R2_test = r2_score(y_test, y_pred)
print(f'R Sqaured {R2_test}')

MAE 11.323845656469924
MSE 210.04529693322982
RMSE 14.49293955459795
R Sqaured 0.8377132727219736


- Mean absolute error is the measure of average magnitude of errors in predictions.
- Mean squared error is the measure of closness of data points to the fitted line. 
- Root mean squared error is just root of MSE and also tells the same.The smaller the value the closer to the fitted line or regression line. We have got 14.49 for our model.
- R2 determines how well the data fits the regression model. Higher the value better the model. We have got 0.837 for our model which is really good value.

### Step 7

In [53]:
X_df1 = df1.drop(columns = ['horsepower', 'price'])
X_df1_sc = std_sc.transform(X_df1)
y_df1_pred = lr.predict(X_df1_sc)
y_df1 = pd.DataFrame(y_df1_pred)
y_df1.columns = ['horsepower']
print(y_df1)
print()
new_index = pd.Series(range(0, 26))
df1_1 = df1_1.set_index([new_index])
df_pred = pd.concat([df1_1, y_df1], axis = 1)
print(df_pred)

    horsepower
0   133.553394
1   128.539925
2   130.294681
3    45.601569
4    71.675750
5    30.087403
6    41.311829
7    70.513778
8    93.242740
9   185.724489
10   74.089814
11  100.460213
12  100.699498
13   73.447613
14  106.848352
15  110.079329
16   28.606773
17   70.155337
18   72.414731
19   70.681764
20  174.581773
21  118.191630
22   67.788930
23   81.064260
24   68.616182
25   90.159249

    symboling  wheel-base  length  width  height  curb-weight  engine-size  \
0           3        88.6   168.8   64.1    48.8         2548          130   
1           1       105.8   192.7   71.4    55.7         2844          136   
2           1       105.8   192.7   71.4    55.7         2954          136   
3           1        94.5   155.9   63.6    52.0         1874           90   
4           1        93.7   157.3   63.8    50.6         1967           90   
5           2        86.6   144.6   63.9    50.8         1713           92   
6           1        93.7   150.0   64.0    52.6

- Coming to the df1, we dropped 'horsepower'(which is our new target) & 'price'(which is our usual target) and named the variable X_df1. Using this data to predict the nan values of the target (horsepower) variable. Before predicting we scaled and transformed the data. Predicted values are stored in the dataframe y_df1 and named the column 'horsepower'. Then we reseted the index because it was randomly selected, then we concatenated the two dataframes y_df1 and df1_1 naming df_pred.

### Step 8 

In [54]:

df_final = df2.append([df_pred])
new = pd.Series(range(0, 199))
df_final = df_final.set_index([new])
print(df_final)
print(df_final.isnull().sum())

     symboling  wheel-base  length  width  height  curb-weight  engine-size  \
0            3        88.6   168.8   64.1    48.8         2548          130   
1            1        94.5   171.2   65.5    52.4         2823          152   
2            2        99.8   176.6   66.2    54.3         2337          109   
3            2        99.4   176.6   66.4    54.3         2824          136   
4            2        99.8   177.3   66.3    53.1         2507          136   
..         ...         ...     ...    ...     ...          ...          ...   
194          2        99.1   186.6   66.5    56.1         2758          121   
195          1        95.7   158.7   63.6    54.5         2015           92   
196          0        95.7   169.7   63.6    59.1         2290           92   
197         -1       102.4   175.6   66.5    54.9         2480          110   
198          2        97.3   171.7   65.5    55.7         2275          109   

     compression-ratio  horsepower  city-mpg  highw

- Now we are appending the df_pred with df2 and again resetting the index and naming it df_final.
- again checking the null values of the final data frame. 
- This df_final consists of all rows and columns similar to 'auto' dataframe from step-2. And also with no NAN values.

### Step 9

In [55]:
A = df_final.drop(columns = 'price')
b = df_final.iloc[:, 8]
A_train, A_test, b_train, b_test = train_test_split(A, b, random_state=78)
std_sc.fit(A_train)
A_train_sc = std_sc.transform(A_train)
A_test_sc = std_sc.transform(A_test)
reg_std_sc = KNeighborsRegressor(n_neighbors=3)
reg_std_sc.fit(A_train_sc, b_train)
train_R2_sc = reg_std_sc.score(A_train_sc, b_train)
test_R2_sc = reg_std_sc.score(A_test_sc, b_test)
print(f'Scaled Training R^2 is {train_R2_sc} and Scaled Testing R^2 is {test_R2_sc} for k = 3')

Scaled Training R^2 is 0.9504567590428413 and Scaled Testing R^2 is 0.9262251243297761 for k = 3


- Here we are doing kNN regression on the df_final with neighbors = 3. Just like before X has independent variable and y has target variable. We are using 'price' as our target variable which is our usual target variable. Again splitting the data and training it on X,y train and predicting it on x test and calculating r2 score on y test. before modeling we scaled and transformed the data.
- We calculated r2 which means goodness-of-fit measure for linear regression models. This statistic indicates the percentage of the variance in the dependent variable that the independent variables explain collectively. After fitting a linear regression model, you need to determine how well the model fits the data. it ranges from 0 to 1. 

### Step 10 

In [56]:
C = auto.drop(columns = 'price')
d = auto.iloc[:, 8]
C_train, C_test, d_train, d_test = train_test_split(C, d, random_state=78)
std_sc.fit(C_train)
C_train_sc = std_sc.transform(C_train)
C_test_sc = std_sc.transform(C_test)
reg_std_sc.fit(C_train_sc, d_train)
train_R2 = reg_std_sc.score(C_train_sc, d_train)
test_R2 = reg_std_sc.score(C_test_sc, d_test)
print(f'Scaled Training R^2 is {train_R2} and Scaled Testing R^2 is {test_R2} for k = 3')

Scaled Training R^2 is 0.9537333173579751 and Scaled Testing R^2 is 0.9100119579207889 for k = 3


- Just like step - 9 we did same thing but on the original dataset(auto) and calculated our r2 value.

### Step 11

- If we compare test R2 values from both step - 9 and step - 10, the R2 slightly increased from the original dataset. 
- The original dataset 'auto' has R2 of 0.910 and for the dataset which we framed after we predicted some values using linear regression model 'df_final' is 0.926.
- We can conclude by saying the model which we created using df_final is better than the model created using original dataset(auto).