#### Removing Data Part II

So, you now have seen how we can fit a model by dropping rows with missing values. This is great in that sklearn doesn't break! However, this means future observations will not obtain a prediction if they have missing values in any of the columns.

In this notebook, you will answer a few questions about what happened in the last screencast, and take a few additional steps.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
import seaborn as sns
%matplotlib inline

In [2]:
df = pd.read_csv('./survey_results_public.csv')

num_vars = df[['Salary', 'CareerSatisfaction', 'HoursPerWeek', 'JobSatisfaction', 'StackOverflowSatisfaction']]

num_vars.head()

Unnamed: 0,Salary,CareerSatisfaction,HoursPerWeek,JobSatisfaction,StackOverflowSatisfaction
0,,,0.0,,9.0
1,,,,,8.0
2,113750.0,8.0,,9.0,8.0
3,,6.0,5.0,3.0,10.0
4,,6.0,,8.0,


#### Question 1
**1.** What proportion of individuals in the dataset reported a salary?

In [3]:
num_vars['Salary'].notnull().mean()

0.2622238509056643

#### Question 2

**2.** Remove the rows associated with nan values in Salary (only Salary) from the dataframe **num_vars**. Store the dataframe with these rows removed in **sal_rem**

In [4]:
sal_rm = num_vars.dropna(axis = 0, how = 'any', subset = ['Salary'])

#### Question 3

**3.** Using **sal_rm**, create **X** be a dataframe (matrix) of all of the numeric feature variables. Then, let **y** be the response vector would like to predict (Salary). Run the cell below once you have split the data, and use the result of the code to assign the correct letter to **question3_solution**.

In [5]:
X = sal_rm[['CareerSatisfaction', 'HoursPerWeek', 'JobSatisfaction', 'StackOverflowSatisfaction']]
y = sal_rm['Salary']

# Split data into training and test data, and fit a Linear model
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .30, random_state = 42)

try:
    lm_model.fit(X_train, y_train)
except:
    print("Oh no!, It doesn't work")

Oh no!, It doesn't work


#### Question 4

**4.** Remove the rows associated with nan values in any column from **num_vars** (this was the removal process used in the screencast). Store the dataframe with these rows removed in **all_rem**

In [6]:
all_rm = num_vars.dropna(axis = 0, how = 'any')

In [7]:
all_rm.count()

Salary                       2147
CareerSatisfaction           2147
HoursPerWeek                 2147
JobSatisfaction              2147
StackOverflowSatisfaction    2147
dtype: int64

#### Question 5

**5.** Using **all_rm**, create **X_2** be the dataframe(matrix) of all of the numeric feature variables. Then, let **y_2** be the response vector you would like to predict(Salary). Run the cell below once you have split the data, and use the result of the code to assign the correct letter to **question5_solution**

In [8]:
X_2 = all_rm[['CareerSatisfaction', 'HoursPerWeek', 'JobSatisfaction', 'StackOverflowSatisfaction']]
y_2 = all_rm['Salary']

# Split data into training and test data, and fit a linear model
X_2_train, X_2_test, y_2_train, y_2_test = train_test_split(X_2, y_2, test_size = .30, random_state = 42)
lm_2_model = LinearRegression(normalize = True)

# If our model works, it should just fit our model to the data. Otherwise, it will let us know.
try:
    lm_2_model.fit(X_2_train, y_2_train)
except:
    print("Oh no!, It doesn't work!!!")

#### Question 6

**6.** Now, use **lm_2_model** to predict the **y_2_test** response values, and obtain an r-squared value for how well the predicted values compare to the actual test values.

In [9]:
y_test_preds = lm_2_model.predict(X_2_test)

In [10]:
r2_test = r2_score(y_2_test, y_test_preds)
r2_test

0.019170661803761813

In [11]:
all_rm.count()

Salary                       2147
CareerSatisfaction           2147
HoursPerWeek                 2147
JobSatisfaction              2147
StackOverflowSatisfaction    2147
dtype: int64

In [12]:
df.count()

Respondent              19102
Professional            19102
ProgramHobby            19102
Country                 19102
University              19102
                        ...  
QuestionsInteresting    12736
QuestionsConfusing      12706
InterestedAnswers       12760
Salary                   5009
ExpectedSalary            818
Length: 154, dtype: int64

In [13]:
len(y_test_preds)

645

In [14]:
lm_2_model.coef_

array([ 3660.26011379,  -570.69046347,  -973.99215225, -3205.885369  ])