## 研究背景以及数据描述
这个数据集是经过脱敏处理的选课记录，包括了学生选这门课时的年级，课程的编号以及成绩。 同时也给出了学生的专业
目的是为了通过选课记录（不包括专业）来判断学生的专业。

## 导入数据以及观察和处理

In [1]:
import pandas as pd
import numpy as np
from sklearn import preprocessing
import matplotlib.pyplot as plt

from sklearn.naive_bayes import BernoulliNB
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
def encode_text_dummy(df, name):
    dummies = pd.get_dummies(df[name])
    for x in dummies.columns:
        dummy_name = "{}-{}".format(name, x)
        df[dummy_name] = dummies[x]
    df.drop(name, axis=1, inplace=True)
def encode_text_index(df, name):
    le = preprocessing.LabelEncoder()
    df[name] = le.fit_transform(df[name])
    return le.classes_    
%matplotlib inline

In [2]:
df = pd.read_csv("training.csv",sep="|")
df.head(10)

Unnamed: 0,student_id,level,course,grade,major
0,ppVGBRKhtqqyxnVO,Freshman,SPAN:100,A,Business
1,PiPkSgMGbFIu5RwR,Freshman,CSI:160,S,International Relations
2,PiPkSgMGbFIu5RwR,Sophomore,EES:107,C,International Relations
3,PiPkSgMGbFIu5RwR,Senior,SPAN:201,B,International Relations
4,PiPkSgMGbFIu5RwR,Junior,ENTR:200,B+,International Relations
5,PiPkSgMGbFIu5RwR,Sophomore,SPAN:140,D+,International Relations
6,PiPkSgMGbFIu5RwR,Freshman,POLI:140,C,International Relations
7,PiPkSgMGbFIu5RwR,Sophomore,KORE:210,A,International Relations
8,PiPkSgMGbFIu5RwR,Sophomore,SPAN:309,A,International Relations
9,PiPkSgMGbFIu5RwR,Sophomore,ENTR:134,B,International Relations


In [3]:
df.shape

(97276, 5)

In [4]:
df.describe() #观察数据,课程数很多，成绩种类也很多，初步判断不是很适合全用独热编码

Unnamed: 0,student_id,level,course,grade,major
count,97276,97276,97276,97276,97276
unique,10000,4,2721,21,81
top,cJdq8OlOg2x4drwW,Senior,RHET:103,A,Psychology
freq,90,30367,1685,18729,10274


课程编号有2721种，成绩有21种，如果用独热编码的话维度会很大

In [5]:
df.groupby(by = "student_id")["student_id"].value_counts().head(10)#可以看到统计很不均，很多学生只统计了一门课，有些学生上了很多门课，因此打算放弃课程上的特征选择

student_id        student_id      
01DiJuoJAB395ucJ  01DiJuoJAB395ucJ     1
01MhxeQl5FhRsf3f  01MhxeQl5FhRsf3f     1
01W7KB8TDNWNx4YW  01W7KB8TDNWNx4YW    14
042Rmpv5B2kXdfBR  042Rmpv5B2kXdfBR     1
04DuzbneGqk0o0jY  04DuzbneGqk0o0jY     1
04TIITMjjPIVIkES  04TIITMjjPIVIkES     1
04vxIYe6guefIhGD  04vxIYe6guefIhGD    37
04yPdcfnDzHbSIsS  04yPdcfnDzHbSIsS    37
04zKgh2DJS9owZNA  04zKgh2DJS9owZNA     1
059Ssc6DVmDrBM7o  059Ssc6DVmDrBM7o    49
Name: student_id, dtype: int64

### 数据的观察
通过利用value_counts等函数，观察到数据比较不平衡，且著有长尾效应（观察过程在此不赘述），很多课程只统计到一名学生，很多学生也只统计到一门课，因此我认为如果利用课程来判断会有一些问题，尤其是在目前没有测试集的情况下（现实中也会出现这种问题，比如新加入的数据中有之前没见过的课程），测试集中可能很多学生选的课在训练集中没有出现，因此我决定会试着利用课程前面的部门编号这个特征

In [6]:
df["Courese_DPT"] = df["course"].apply(lambda x: x.split(sep=":")[0])#把课程代号的department部分设为一个特征

In [7]:
df["Course_Num"] = df["course"].apply(lambda x: x.split(sep=":")[1])#把课程代号的数字部分设为一个特征

In [8]:
df.head(5)#数据的样子

Unnamed: 0,student_id,level,course,grade,major,Courese_DPT,Course_Num
0,ppVGBRKhtqqyxnVO,Freshman,SPAN:100,A,Business,SPAN,100
1,PiPkSgMGbFIu5RwR,Freshman,CSI:160,S,International Relations,CSI,160
2,PiPkSgMGbFIu5RwR,Sophomore,EES:107,C,International Relations,EES,107
3,PiPkSgMGbFIu5RwR,Senior,SPAN:201,B,International Relations,SPAN,201
4,PiPkSgMGbFIu5RwR,Junior,ENTR:200,B+,International Relations,ENTR,200


In [9]:
df.describe() #可以看到部门只有182种，因此大大降低了维度，并且在测试集中不太苛征出现新的部门

Unnamed: 0,student_id,level,course,grade,major,Courese_DPT,Course_Num
count,97276,97276,97276,97276,97276,97276,97276
unique,10000,4,2721,21,81,182,362
top,cJdq8OlOg2x4drwW,Senior,RHET:103,A,Psychology,PSY,100
freq,90,30367,1685,18729,10274,4373,6591


由于课程编号或者部门不具有order，利用one-hot encoding对categorical的特征进行编码，形成稀疏矩阵

## 训练
### 1. 对于单条选课记录，判断major
利用课程来判断

In [10]:
encode_text_dummy(df,"course")

In [11]:
df.head(5)

Unnamed: 0,student_id,level,grade,major,Courese_DPT,Course_Num,course-006:100,course-ABRD:301,course-ABRD:302,course-ABRD:303,...,course-WRIT:100,course-WRIT:140,course-WRIT:160,course-WRIT:310,course-WRIT:326,course-WRIT:374,course-WRIT:390,course-WRIT:400,course-WRIT:474,course-WRIT:476
0,ppVGBRKhtqqyxnVO,Freshman,A,Business,SPAN,100,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,PiPkSgMGbFIu5RwR,Freshman,S,International Relations,CSI,160,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,PiPkSgMGbFIu5RwR,Sophomore,C,International Relations,EES,107,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,PiPkSgMGbFIu5RwR,Senior,B,International Relations,SPAN,201,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,PiPkSgMGbFIu5RwR,Junior,B+,International Relations,ENTR,200,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [12]:
le = preprocessing.LabelEncoder()
df_y = le.fit_transform(df["major"])

#我们想要预测的major，利用labelencoder转换成不同的数字，以便于输入到分类model里面训练

In [13]:
df_x = df.copy()
for column in df.columns[:6]:
    df_x.drop(column,axis=1,inplace = True)
#得到稀疏矩阵df_x，以便于训练    

由于特征过于简单，虽然维度高但是没有共线性，不用降维，只需要用最简单的朴素贝叶斯即可

In [14]:
x_train, x_test, y_train, y_test = train_test_split(    
    df_x, df_y, test_size=0.25, random_state=42)
NB = BernoulliNB()
NB.fit(x_train,y_train)
y_pred=NB.predict(x_test)

In [15]:
metrics.accuracy_score(y_pred,y_test) # 仅仅根据一门课的课程编号来判断这个学生major的正确率并不高，但是比猜众数要高

0.3780994284304453

利用课程的department来判断，仅仅看这个选课属于哪个部门来判断major

In [16]:
df_x = pd.get_dummies(df["Courese_DPT"])
x_train, x_test, y_train, y_test = train_test_split(    
    df_x, df_y, test_size=0.25, random_state=42)

In [17]:
NB = BernoulliNB()
NB.fit(x_train,y_train)
y_pred=NB.predict(x_test)
metrics.accuracy_score(y_pred,y_test) # 仅仅根据一门课的部门来判断这个学生major的正确率并不高，但是和利用两千七百多维的相差不大

0.3341831489781652

## 2.利用学生的所有选课记录来判断
前面仅仅利用了课程编号或者部门编号这一信息，没有利用到数据中重复出现的student_id。我们可以将学生所有选过的课集合起来判断
利用课程

In [18]:
df_x = df.copy()
for column in df.columns[1:6]:
    df_x.drop(column,axis=1,inplace = True)
df_x.head(5)   

Unnamed: 0,student_id,course-006:100,course-ABRD:301,course-ABRD:302,course-ABRD:303,course-ABRD:304,course-ABRD:306,course-ABRD:307,course-ABRD:308,course-ABRD:309,...,course-WRIT:100,course-WRIT:140,course-WRIT:160,course-WRIT:310,course-WRIT:326,course-WRIT:374,course-WRIT:390,course-WRIT:400,course-WRIT:474,course-WRIT:476
0,ppVGBRKhtqqyxnVO,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,PiPkSgMGbFIu5RwR,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,PiPkSgMGbFIu5RwR,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,PiPkSgMGbFIu5RwR,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,PiPkSgMGbFIu5RwR,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [19]:
student = df_x.groupby(by = "student_id",sort =False)
df_x_ = student.sum()
df_x_.head(10)

Unnamed: 0_level_0,course-006:100,course-ABRD:301,course-ABRD:302,course-ABRD:303,course-ABRD:304,course-ABRD:306,course-ABRD:307,course-ABRD:308,course-ABRD:309,course-ABRD:311,...,course-WRIT:100,course-WRIT:140,course-WRIT:160,course-WRIT:310,course-WRIT:326,course-WRIT:374,course-WRIT:390,course-WRIT:400,course-WRIT:474,course-WRIT:476
student_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ppVGBRKhtqqyxnVO,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
PiPkSgMGbFIu5RwR,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
x9wNE71Wzj7cIiGV,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Ilym6N264yshysce,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
kO4XKESuy7bn0XLg,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5XHKpFQzdvsSNArX,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
F1G3okZqalg3hriD,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
iBjaatchfqc0CuDg,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Jmq59W8bu3EZoelN,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
EWg2FxS2c4PGuKp9,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0


In [20]:
df_x_.shape # 确实为一万名学生

(10000, 2721)

In [21]:
df_y_ = df.groupby(by = "student_id",sort = False).head(1)["major"]
df_y_.head(10) #学生的major

0                      Business
1       International Relations
13          Medical Engineering
14                   Psychology
46                   Psychology
54    Interdepartmental Studies
68                     Business
69                      English
70    Health And Sport Sciences
71                     Medicine
Name: major, dtype: object

In [22]:
y = le.fit_transform(df_y_) #转换为数值

x_train, x_test, y_train, y_test = train_test_split(    
    df_x_, y, test_size=0.1, random_state=42)
NB2 = BernoulliNB()
NB2.fit(x_train,y_train)
y_pred=NB2.predict(x_test)
metrics.accuracy_score(y_pred,y_test)  #朴素贝叶斯方法

0.765

In [23]:
x_train, x_test, y_train, y_test = train_test_split(    
    df_x_, y, test_size=0.25, random_state=42)
lr = LogisticRegression()
lr.fit(x_train,y_train)
y_pred=lr.predict(x_test)
metrics.accuracy_score(y_pred,y_test) #逻辑斯蒂回归方法

0.9212

In [24]:
x_train, x_test, y_train, y_test = train_test_split(    
    df_x_, y, test_size=0.1, random_state=42)
lr = LogisticRegression()
lr.fit(x_train,y_train)
y_pred=lr.predict(x_test)
metrics.accuracy_score(y_pred,y_test)

0.919

In [26]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

In [27]:
lda_clf = LDA()
lda_clf.fit(x_train,y_train)
y_pred=lda_clf.predict(x_test)
metrics.accuracy_score(y_pred,y_test) #线性判别分析法



0.742

In [28]:
pca = PCA(n_components=100) #利用PCA进行降维
pca.fit(x_train)
x_train = pca.transform(x_train)
x_test = pca.transform(x_test)
lda_clf = LDA()
lda_clf.fit(x_train,y_train)
y_pred=lda_clf.predict(x_test)
metrics.accuracy_score(y_pred,y_test) #线性判别分析法,降维之后提升了准确度，运算速率也有所提升

0.824

In [29]:
lr = LogisticRegression() #利用降维后的数据在ligistic上面预测
lr.fit(x_train,y_train)
y_pred=lr.predict(x_test)
metrics.accuracy_score(y_pred,y_test)

0.898

通过选课的部门进行判断，看这个学生分别在哪个部门选了几门课

In [30]:
x_DPT = df[["student_id","Courese_DPT"]].copy()
encode_text_dummy(x_DPT,"Courese_DPT")
x_DPT = x_DPT.groupby(by = "student_id",sort = False).sum()

In [31]:
x_DPT.shape

(10000, 182)

In [32]:
x_train, x_test, y_train, y_test = train_test_split(    
    x_DPT, y, test_size=0.25, random_state=43)
lr1 = LogisticRegression()
lr1.fit(x_train,y_train)
y_pred=lr1.predict(x_test)
metrics.accuracy_score(y_pred,y_test) #逻辑斯蒂回归

0.9088

In [33]:
x_train, x_test, y_train, y_test = train_test_split(    
    x_DPT, y, test_size=0.1, random_state=43)
lr1 = LogisticRegression()
lr1.fit(x_train,y_train)
y_pred=lr1.predict(x_test)
metrics.accuracy_score(y_pred,y_test) #在更多的训练数据上训练

0.927

In [34]:
x_train, x_test, y_train, y_test = train_test_split(    
    x_DPT, y, test_size=0.1) #random state 不指定，可以更好的看到测试误差具有代表性
lr1 = LogisticRegression()
lr1.fit(x_train,y_train)
y_pred=lr1.predict(x_test)
metrics.accuracy_score(y_pred,y_test) 

0.881

In [35]:
NB = BernoulliNB()
NB.fit(x_train,y_train)
y_pred=NB.predict(x_test)
metrics.accuracy_score(y_pred,y_test) #朴素贝叶斯方法

0.808

In [36]:
from sklearn.ensemble import RandomForestClassifier
rf1 = RandomForestClassifier(n_estimators=100,max_features=20)
rf1.fit(x_train,y_train)
y_pred=rf1.predict(x_test)
metrics.accuracy_score(y_pred,y_test) #有的人选课多，有的人选课少，因此方差可能较大，用随机森林试一下

  from numpy.core.umath_tests import inner1d


0.843

In [37]:
y_pred=rf1.predict(x_train)
metrics.accuracy_score(y_pred,y_train)#看一下训练误差，可以看到过拟合较为严重。个人感觉是数据不均衡导致的 可以参看
#  https://zhuanlan.zhihu.com/p/43997117

0.9624444444444444

In [38]:
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=5)
neigh.fit(x_train, y_train)
y_pred=neigh.predict(x_test)
metrics.accuracy_score(y_pred,y_test) # KNN方式

0.869

In [39]:
lda_clf = LDA()
lda_clf.fit(x_train,y_train)
y_pred=lda_clf.predict(x_test)
metrics.accuracy_score(y_pred,y_test) #线性判别分析法

0.782

In [40]:
pca = PCA(n_components=50) #利用PCA进行降维
pca.fit(x_train)
x_train = pca.transform(x_train)
x_test = pca.transform(x_test)
lda_clf = LDA()
lda_clf.fit(x_train,y_train)
y_pred=lda_clf.predict(x_test)
metrics.accuracy_score(y_pred,y_test) #线性判别分析法,降维之后提升了准确度，运算速率也有所提升

0.79

## 关于测试集的预测
我最终选择利用每个学生的选课部门统计来预测，具体原因是 1.维度低 2.泛化能力高

In [41]:

LR = LogisticRegression()
LR.fit(x_DPT, y)
from sklearn.externals import joblib
joblib.dump(LR, "train_model.m")

['train_model.m']

In [42]:
#将特征转换方式保存，之后用于对测试集做同样的映射，这样不会出现错误
columns = list(pd.get_dummies(df["Courese_DPT"]).columns)
file_handle =open('DPT.txt',mode='w')
for column in columns:
    file_handle.write(column)
    file_handle.write('\n')
file_handle.close()

In [43]:
pd.get_dummies(df["Courese_DPT"]).columns

Index(['006', 'ABRD', 'ACB', 'ACCT', 'ACTS', 'AERO', 'AFAM', 'AINS', 'AMST',
       'ANIM',
       ...
       'THTR', 'TR', 'TRNS', 'UHSG', 'UICB', 'ULIB', 'URES', 'URP', 'WLLC',
       'WRIT'],
      dtype='object', length=182)

In [44]:
joblib.dump(le, "LabelEncoder.m")

['LabelEncoder.m']

In [45]:
joblib.dump(lda_clf,"LDA.m")
joblib.dump(pca,"PCA.m")

['PCA.m']

### 运用在预测测试集时，load模型，将测试集做同样的转换生成数据集，进行预测后得到到学生id到major的映射

In [0]:
clf = joblib.load("train_model.m") # load 模型
le = joblib.load("LabelEncoder.m")
test_csv = "training.csv"  #这边是测试集的名字，如有不同要改正
test_df = pd.read_csv(test_csv,sep="|") #读取数据

file_handler =open('DPT.txt',mode='r')
columns = []
contents = file_handler.readlines()
for column in contents:
    column = column.strip('\n')
    columns.append(column)
file_handler.close()    

In [0]:
test_df["Courese_DPT"] = test_df["course"].apply(lambda x: x.split(sep=":")[0])#把课程代号的department部分设为一个特征

for column in columns:
    test_df[column] = test_df["Courese_DPT"].apply(lambda x:1 if x==column else 0)#lambda x:x+1 if  2==1 else 0

In [0]:
test_df.head(5)

Unnamed: 0,student_id,level,course,grade,major,Courese_DPT,006,ABRD,ACB,ACCT,...,THTR,TR,TRNS,UHSG,UICB,ULIB,URES,URP,WLLC,WRIT
0,ppVGBRKhtqqyxnVO,Freshman,SPAN:100,A,Business,SPAN,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,PiPkSgMGbFIu5RwR,Freshman,CSI:160,S,International Relations,CSI,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,PiPkSgMGbFIu5RwR,Sophomore,EES:107,C,International Relations,EES,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,PiPkSgMGbFIu5RwR,Senior,SPAN:201,B,International Relations,SPAN,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,PiPkSgMGbFIu5RwR,Junior,ENTR:200,B+,International Relations,ENTR,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [0]:
x = test_df.copy()

for column in test_df.columns[1:6]:  #删除需要的特征前面所有列，作为数据
    x.drop(column,axis=1,inplace = True)
x = x.groupby(by ="student_id",as_index =False,sort = False).sum()
ids = x["student_id"]    
x.drop("student_id",axis =1,inplace = True)

In [0]:
x.head(1)

Unnamed: 0,006,ABRD,ACB,ACCT,ACTS,AERO,AFAM,AINS,AMST,ANIM,...,THTR,TR,TRNS,UHSG,UICB,ULIB,URES,URP,WLLC,WRIT
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [0]:
y_pred = clf.predict(x) 
df = {"student_id":ids,"major":le.inverse_transform(y_pred)}

  if diff:
