### 删除缺失值 - Part II

现在你已经了解了如何删除存在缺失值的行来拟合模型，这样很棒，因为 sklearn 不会因为有缺失值而报错了。但这也意味着，我们将无法预测包含缺失值的数据。

在这个 Notebook 里，我们将回答上一视频里的几个问题，并且进行更多的步骤。

In [9]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
import seaborn as sns
# import RemovingData as t
%matplotlib inline

df = pd.read_csv('../Data/survey_results_public.csv')

#Subset to only quantitative vars
num_vars = df[['Salary', 'CareerSatisfaction', 'HoursPerWeek', 'JobSatisfaction', 'StackOverflowSatisfaction']]


num_vars.head()
num_vars.shape

(19102, 5)

#### Question 1

**1.** 数据里提供了工资信息的人，其占比是多少？

In [11]:
num_vars['Salary'].isnull()

0         True
1         True
2        False
3         True
4         True
         ...  
19097     True
19098     True
19099     True
19100    False
19101     True
Name: Salary, Length: 19102, dtype: bool

In [10]:
#方法一
prop_sals = num_vars.dropna(subset=['Salary']).shape[0] / num_vars.shape[0]# Proportion of individuals in the dataset with salary reported
#方法二
prop_sals1 = 1 - num_vars['Salary'].isnull().mean()
prop_sals1

0.26222385090566436

In [55]:
t.prop_sals_test(prop_sals) #test

Nice job! That looks right!


#### Question 2

**2.** 删除 **num_vars** 数据集中，Salary 列存在缺失值的所有数据行。将得到的新数据保存在 **sal_rem** 变量中。

In [12]:
sal_rm = num_vars.dropna(subset=['Salary'])# dataframe with rows for nan Salaries removed

sal_rm.head()
# sal_rm.shape

Unnamed: 0,Salary,CareerSatisfaction,HoursPerWeek,JobSatisfaction,StackOverflowSatisfaction
2,113750.0,8.0,,9.0,8.0
14,100000.0,8.0,,8.0,8.0
17,130000.0,9.0,,8.0,8.0
18,82500.0,5.0,,3.0,
22,100764.0,8.0,,9.0,8.0


In [57]:
t.sal_rm_test(sal_rm) #test

Nice job! That looks right!


#### Question 3

**3.** 使用 **sal_rm** 数据中的所有数值变量，创建一个 DataFrame `X`（矩阵）。将要预测的目标变量（Salary）保存到 `y` 中。划分好数据之后，运行下面的代码，根据得到的结果，将正确的字母与 **question3_solution** 里的陈述匹配。

In [58]:
X = sal_rm[['CareerSatisfaction', 'HoursPerWeek', 'JobSatisfaction', 'StackOverflowSatisfaction']]#Create X using explanatory variables from sal_rm
y = sal_rm['Salary']#Create y using the response variable of Salary

# Split data into training and test data, and fit a linear model
X_train, X_test, y_train, y_test = train_test_split(X, y , test_size=.30, random_state=42)

#normalize参数的作用是将每列的值减去其均值，
#然后再除以标准偏差，做标准化处理
lm_model = LinearRegression(normalize=True)

# If our model works, it should just fit our model to the data. Otherwise, it will let us know.
try:
    lm_model.fit(X_train, y_train)
except:
    print("Oh no! It doesn't work!!!")


Oh no! It doesn't work!!!


In [59]:
a = 'Python just likes to break sometimes for no reason at all.' 
b = 'It worked, because Python is magic.'
c = 'It broke because we still have missing values in X'

question3_solution = c

#test
t.question3_check(question3_solution)

Nice job! That's right! Those missing values in the X matrix will still not allow us to predict the response.


#### Question 4

**4.** 移除 **num_vars** 中所有包含缺失值的行（之前视频中有讲到过）。将得到的数据存放在 **all_rm** 变量中。 

In [60]:
all_rm = num_vars.dropna()# dataframe with rows for any nan column removed

all_rm.head()
all_rm.shape


(2147, 5)

In [61]:
t.all_rm_test(all_rm) #test

Nice job! That looks right.  The default is to drop any row with a missing value in any column, so we didn't need to specify any arguments in this case.


#### Question 5

**5.** 提取 **all_rm** 中所有的数值变量，并将其存在 **X_2** 变量中。需要预测的 Salary 存在 **y_2** 中。划分好数据之后，运行下面的代码，依据得到的结果，将正确的字母与 **question5_solution** 里的陈述匹配。

In [62]:
X_2 = all_rm[['CareerSatisfaction', 'HoursPerWeek', 'JobSatisfaction', 'StackOverflowSatisfaction']]#Create X using explanatory variables from sal_rm
y_2 = all_rm['Salary']#Create y using Salary from sal_rm

# Split data into training and test data, and fit a linear model
X_2_train, X_2_test, y_2_train, y_2_test = train_test_split(X_2, y_2 , test_size=.30, random_state=42)
lm_2_model = LinearRegression(normalize=True)

# If our model works, it should just fit our model to the data. Otherwise, it will let us know.
try:
    lm_2_model.fit(X_2_train, y_2_train)
except:
    print("Oh no! It doesn't work!!!")

In [63]:
y_2_test.shape

(645,)

In [64]:
a = 'Python just likes to break sometimes for no reason at all.' 
b = 'It worked, because Python is magic.'
c = 'It broke because we still have missing values in X'

question5_solution = b

#test
t.question5_check(question5_solution)

Nice job! That's right! Python isn't exactly magic, but sometimes it feels like it is!


#### Question 6

**6.** 现在，用 **lm_2_model** 模型来预测 **y_2_test**，并计算 R 平方值，以评估模型的预测效果。

r2_score()用于计算R平方值，即R2 决定系数（拟合优度），能够评估模型的预测效果。反映因变量的全部变异能通过回归关系被自变量解释的比例。

模型越好：r2→1

模型越差：r2→0

![image.png](attachment:image.png)

In [65]:
y_test_preds = lm_2_model.predict(X_2_test)# Predictions here using X_2 and lm_2_model
r2_test = r2_score(y_2_test, y_test_preds)# Rsquared here for comparing test and preds from lm_2_model
# list(y_test)
# list(y_test_preds)
# Print r2 to see result
r2_test

0.019170661803761924

In [66]:
t.r2_test_check(r2_test)

Nice job! That's right! Your rsquared matches the solution.


#### Question 7

**7.** 用你之前学到的知识，将下面的字母与相应的陈述匹配。

In [74]:
a = 5009
b = 'Other'
c = 645
d = 'We still want to predict their salary'
e = 'We do not care to predict their salary'
f = False
g = True

question7_solution = {'The number of reported salaries in the original dataset': a,
                       'The number of test salaries predicted using our model': c,
                       'If an individual does not rate stackoverflow, but has a salary': d,
                       'If an individual does not have a a job satisfaction, but has a salary': d,
                       'Our model predicts salaries for the two individuals described above.': f}
                      
                      
#Check your answers against the solution - you should be told you were right if your answers are correct!                     
t.question7_check(question7_solution)

Nice job! That looks right to me!  We would really like to predict for anyone who provides a salary, but our model right now definitely has some limitations.


In [None]:
#Cell for work

In [None]:
#Cell for work