#### Imputing Values

You now have some experience working missing values, and imputing based on common methods. Now, it is your turn to put your skills to work in being able to predict for rows even when they have NaN values.

First, let's read in the necessary libraries, and get the results together from what you achieved in the previous attempt.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
import seaborn as sns
%matplotlib inline

In [2]:
df = pd.read_csv('./survey_results_public.csv')

In [3]:
# Only use quant variables and drop any rows with missing values
num_vars = df[['Salary', 'CareerSatisfaction', 'HoursPerWeek', 'JobSatisfaction', 'StackOverflowSatisfaction']]
df_dropna = num_vars.dropna(axis = 0)

In [4]:
# Split into explanatory and response variables
X = df_dropna[['CareerSatisfaction', 'HoursPerWeek', 'JobSatisfaction', 'StackOverflowSatisfaction']]
y = df_dropna['Salary']

In [5]:
# Split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .30, random_state = 42)

In [6]:
lm_model = LinearRegression(normalize = True) # Instatiate
lm_model.fit(X_train, y_train) # Fit

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=True)

In [7]:
# predict and score the model
y_test_preds = lm_model.predict(X_test)

"The r-squared score for your model was {} on {} values.".format(r2_score(y_test, y_test_preds), len(y_test))

'The r-squared score for your model was 0.019170661803761813 on 645 values.'

#### Question 1

**1.** As you may remember from an earlier analysis, there are many more salaries to predict than the values shown from the above code. One of the ways we can start to make predictions on these values is by imputing items into the **X** matrix instead of dropping them

Using the **num_vars** dataframe drop the rows with missing values of the response (Salary) - store this new dataframe in **drop_sal_df**, then impute the values for all the other missing values with the mean of the column - store this in **fill_df**.

In [8]:
drop_sal_df = num_vars.dropna(axis = 0, how = 'any', subset = ['Salary'])

In [9]:
fill_df = drop_sal_df.fillna(drop_sal_df.mean())

# fill_mean = lambda col: col.fillna(col.mean()) # Mean function
# fill_df = drop_sal_df.apply(fill_mean, axis = 0)

#### Question 2

**2.** Using **fill_df**, predict Salary based on all of the other quantitative variables in the dataset. You can use the template above to assist fitting your model:

    1. Split the data into explanatory and response variables
    2. Split the data into train and test (using seed of 42 and test_size of .30 as above)
    3. Instantiate your linear model using normalized data
    4. Fit your model on the training data
    5. Predict using the test data
    6. Compute a score for your model fit on all the data, and show how many rows you predicted for
    
Use the tests to assure you completed the steps correctly.

In [10]:
# Split into explanatory and response variables
X = fill_df[['CareerSatisfaction', 'HoursPerWeek', 'JobSatisfaction', 'StackOverflowSatisfaction']]
y = fill_df['Salary']

# Split int train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .30, random_state = 42)

# Predict and score the model
lm_model = LinearRegression(normalize = True)
lm_model.fit(X_train, y_train)

y_test_preds = lm_model.predict(X_test)

rsquared_score = r2_score(y_test, y_test_preds) 
length_y_test = len(y_test)

"The r-squared score for your model was {} on {} values.".format(rsquared_score, length_y_test)

'The r-squared score for your model was 0.03257139063404435 on 1503 values.'