## Description of background and data set
This data set is desensitized data of students' registration on courses，including course code, grade and year when the student selected this course。 Also the major is provided as the target.
The task is to predict students' majors based on their registration history.

## Import the data, and emplore

In [1]:
import pandas as pd
import numpy as np
from sklearn import preprocessing
import matplotlib.pyplot as plt

from sklearn.naive_bayes import BernoulliNB
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
def encode_text_dummy(df, name):
    dummies = pd.get_dummies(df[name])
    for x in dummies.columns:
        dummy_name = "{}-{}".format(name, x)
        df[dummy_name] = dummies[x]
    df.drop(name, axis=1, inplace=True)
def encode_text_index(df, name):
    le = preprocessing.LabelEncoder()
    df[name] = le.fit_transform(df[name])
    return le.classes_    
%matplotlib inline

In [2]:
df = pd.read_csv("training.csv",sep="|")
df.head(10)

Unnamed: 0,student_id,level,course,grade,major
0,ppVGBRKhtqqyxnVO,Freshman,SPAN:100,A,Business
1,PiPkSgMGbFIu5RwR,Freshman,CSI:160,S,International Relations
2,PiPkSgMGbFIu5RwR,Sophomore,EES:107,C,International Relations
3,PiPkSgMGbFIu5RwR,Senior,SPAN:201,B,International Relations
4,PiPkSgMGbFIu5RwR,Junior,ENTR:200,B+,International Relations
5,PiPkSgMGbFIu5RwR,Sophomore,SPAN:140,D+,International Relations
6,PiPkSgMGbFIu5RwR,Freshman,POLI:140,C,International Relations
7,PiPkSgMGbFIu5RwR,Sophomore,KORE:210,A,International Relations
8,PiPkSgMGbFIu5RwR,Sophomore,SPAN:309,A,International Relations
9,PiPkSgMGbFIu5RwR,Sophomore,ENTR:134,B,International Relations


In [3]:
df.shape

(97276, 5)

In [4]:
df.describe() 
#according to the data,there are many courses and grades，although those are categorical features, it would be not appropriate 
# to do one-hot encoding, which would induce high dimensionality.

Unnamed: 0,student_id,level,course,grade,major
count,97276,97276,97276,97276,97276
unique,10000,4,2721,21,81
top,cJdq8OlOg2x4drwW,Senior,RHET:103,A,Psychology
freq,90,30367,1685,18729,10274


According to the data,there are many courses and grades，although those are categorical features, it would be not appropriate 
 to do one-hot encoding, which would induce high dimensionality.

In [5]:
df.groupby(by = "student_id")["student_id"].value_counts().head(10)#可以看到统计很不均，很多学生只统计了一门课，有些学生上了很多门课，因此打算放弃课程上的特征选择

student_id        student_id      
01DiJuoJAB395ucJ  01DiJuoJAB395ucJ     1
01MhxeQl5FhRsf3f  01MhxeQl5FhRsf3f     1
01W7KB8TDNWNx4YW  01W7KB8TDNWNx4YW    14
042Rmpv5B2kXdfBR  042Rmpv5B2kXdfBR     1
04DuzbneGqk0o0jY  04DuzbneGqk0o0jY     1
04TIITMjjPIVIkES  04TIITMjjPIVIkES     1
04vxIYe6guefIhGD  04vxIYe6guefIhGD    37
04yPdcfnDzHbSIsS  04yPdcfnDzHbSIsS    37
04zKgh2DJS9owZNA  04zKgh2DJS9owZNA     1
059Ssc6DVmDrBM7o  059Ssc6DVmDrBM7o    49
Name: student_id, dtype: int64

### Analysis on the data
Using functions such as value_counts，we can see that data distribution is not balanced，and there is long tail effect. So I think using course codes would cause some problems. Such as, many courses are only selected by only one student. Loss of generalization may occur.

Therefore I decide to use the department of the course as feature, which is more general and more balanced.

In [6]:
df["Courese_DPT"] = df["course"].apply(lambda x: x.split(sep=":")[0])#Set the department code in course code as a feature

In [7]:
df["Course_Num"] = df["course"].apply(lambda x: x.split(sep=":")[1])#Set the number code in course code as a feature

In [8]:
df.head(5)#Show the data

Unnamed: 0,student_id,level,course,grade,major,Courese_DPT,Course_Num
0,ppVGBRKhtqqyxnVO,Freshman,SPAN:100,A,Business,SPAN,100
1,PiPkSgMGbFIu5RwR,Freshman,CSI:160,S,International Relations,CSI,160
2,PiPkSgMGbFIu5RwR,Sophomore,EES:107,C,International Relations,EES,107
3,PiPkSgMGbFIu5RwR,Senior,SPAN:201,B,International Relations,SPAN,201
4,PiPkSgMGbFIu5RwR,Junior,ENTR:200,B+,International Relations,ENTR,200


In [9]:
df.describe() # Now we can see that there are only 182 departments. 
#Dimensionality is  highly reduced and the ability of generalization is enhanced

Unnamed: 0,student_id,level,course,grade,major,Courese_DPT,Course_Num
count,97276,97276,97276,97276,97276,97276,97276
unique,10000,4,2721,21,81,182,362
top,cJdq8OlOg2x4drwW,Senior,RHET:103,A,Psychology,PSY,100
freq,90,30367,1685,18729,10274,4373,6591


Do one hot encoding on department as this is a categorical feature and there is no order in it

## Training
### 1. Predicting major based on only one history
Use course code(high dimension) as the feature

In [10]:
encode_text_dummy(df,"course")

In [11]:
df.head(5)

Unnamed: 0,student_id,level,grade,major,Courese_DPT,Course_Num,course-006:100,course-ABRD:301,course-ABRD:302,course-ABRD:303,...,course-WRIT:100,course-WRIT:140,course-WRIT:160,course-WRIT:310,course-WRIT:326,course-WRIT:374,course-WRIT:390,course-WRIT:400,course-WRIT:474,course-WRIT:476
0,ppVGBRKhtqqyxnVO,Freshman,A,Business,SPAN,100,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,PiPkSgMGbFIu5RwR,Freshman,S,International Relations,CSI,160,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,PiPkSgMGbFIu5RwR,Sophomore,C,International Relations,EES,107,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,PiPkSgMGbFIu5RwR,Senior,B,International Relations,SPAN,201,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,PiPkSgMGbFIu5RwR,Junior,B+,International Relations,ENTR,200,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [12]:
le = preprocessing.LabelEncoder()
df_y = le.fit_transform(df["major"])

#Use labelencoder to transform majors into numbers, in order to train

In [13]:
df_x = df.copy()
for column in df.columns[:6]:
    df_x.drop(column,axis=1,inplace = True)
#Got the sparse matrix df_x，to be predicted on    

The feature is simple and there is on correlation. So let's try Naive Bayes

In [14]:
x_train, x_test, y_train, y_test = train_test_split(    
    df_x, df_y, test_size=0.25, random_state=42)
NB = BernoulliNB()
NB.fit(x_train,y_train)
y_pred=NB.predict(x_test)

In [15]:
metrics.accuracy_score(y_pred,y_test) # The outcome is not good, but better that predict all as the Mode

0.3780994284304453

Using the department to predict major

In [16]:
df_x = pd.get_dummies(df["Courese_DPT"])
x_train, x_test, y_train, y_test = train_test_split(    
    df_x, df_y, test_size=0.25, random_state=42)

In [17]:
NB = BernoulliNB()
NB.fit(x_train,y_train)
y_pred=NB.predict(x_test)
metrics.accuracy_score(y_pred,y_test) # The outcome is close to which using full course code. But demension 182 vs 2721

0.3341831489781652

## 2.Use all histories of one student to predict the major
We did not utilized the feature student_id which appear repeatedly. We can use this to combine all the history of each student

In [18]:
df_x = df.copy()
for column in df.columns[1:6]:
    df_x.drop(column,axis=1,inplace = True)
df_x.head(5)   

Unnamed: 0,student_id,course-006:100,course-ABRD:301,course-ABRD:302,course-ABRD:303,course-ABRD:304,course-ABRD:306,course-ABRD:307,course-ABRD:308,course-ABRD:309,...,course-WRIT:100,course-WRIT:140,course-WRIT:160,course-WRIT:310,course-WRIT:326,course-WRIT:374,course-WRIT:390,course-WRIT:400,course-WRIT:474,course-WRIT:476
0,ppVGBRKhtqqyxnVO,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,PiPkSgMGbFIu5RwR,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,PiPkSgMGbFIu5RwR,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,PiPkSgMGbFIu5RwR,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,PiPkSgMGbFIu5RwR,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [19]:
student = df_x.groupby(by = "student_id",sort =False)
df_x_ = student.sum()
df_x_.head(10)

Unnamed: 0_level_0,course-006:100,course-ABRD:301,course-ABRD:302,course-ABRD:303,course-ABRD:304,course-ABRD:306,course-ABRD:307,course-ABRD:308,course-ABRD:309,course-ABRD:311,...,course-WRIT:100,course-WRIT:140,course-WRIT:160,course-WRIT:310,course-WRIT:326,course-WRIT:374,course-WRIT:390,course-WRIT:400,course-WRIT:474,course-WRIT:476
student_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ppVGBRKhtqqyxnVO,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
PiPkSgMGbFIu5RwR,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
x9wNE71Wzj7cIiGV,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Ilym6N264yshysce,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
kO4XKESuy7bn0XLg,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5XHKpFQzdvsSNArX,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
F1G3okZqalg3hriD,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
iBjaatchfqc0CuDg,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Jmq59W8bu3EZoelN,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
EWg2FxS2c4PGuKp9,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0


In [20]:
df_x_.shape # 10,000 students

(10000, 2721)

In [21]:
df_y_ = df.groupby(by = "student_id",sort = False).head(1)["major"]
df_y_.head(10) #major

0                      Business
1       International Relations
13          Medical Engineering
14                   Psychology
46                   Psychology
54    Interdepartmental Studies
68                     Business
69                      English
70    Health And Sport Sciences
71                     Medicine
Name: major, dtype: object

In [22]:
y = le.fit_transform(df_y_) #label encoding

x_train, x_test, y_train, y_test = train_test_split(    
    df_x_, y, test_size=0.1, random_state=42)
NB2 = BernoulliNB()
NB2.fit(x_train,y_train)
y_pred=NB2.predict(x_test)
metrics.accuracy_score(y_pred,y_test)  # Naive Bayes

0.765

In [23]:
x_train, x_test, y_train, y_test = train_test_split(    
    df_x_, y, test_size=0.25, random_state=42)
lr = LogisticRegression()
lr.fit(x_train,y_train)
y_pred=lr.predict(x_test)
metrics.accuracy_score(y_pred,y_test) # Logistice Regression

0.9212

In [24]:
x_train, x_test, y_train, y_test = train_test_split(    
    df_x_, y, test_size=0.1, random_state=42)
lr = LogisticRegression()
lr.fit(x_train,y_train)
y_pred=lr.predict(x_test)
metrics.accuracy_score(y_pred,y_test)

0.919

In [28]:
pca = PCA(n_components=100) # Reduce dimensionality using PCA
pca.fit(x_train)
x_train = pca.transform(x_train)
x_test = pca.transform(x_test)


0.824

In [29]:
lr = LogisticRegression() # Logistic Regression on reduced data
lr.fit(x_train,y_train)
y_pred=lr.predict(x_test)
metrics.accuracy_score(y_pred,y_test)

0.898

Using only the department code

In [30]:
x_DPT = df[["student_id","Courese_DPT"]].copy()
encode_text_dummy(x_DPT,"Courese_DPT")
x_DPT = x_DPT.groupby(by = "student_id",sort = False).sum()

In [31]:
x_DPT.shape

(10000, 182)

In [32]:
x_train, x_test, y_train, y_test = train_test_split(    
    x_DPT, y, test_size=0.25, random_state=43)
lr1 = LogisticRegression()
lr1.fit(x_train,y_train)
y_pred=lr1.predict(x_test)
metrics.accuracy_score(y_pred,y_test) # Logistic Regression

0.9088

In [33]:
x_train, x_test, y_train, y_test = train_test_split(    
    x_DPT, y, test_size=0.1, random_state=43)
lr1 = LogisticRegression()
lr1.fit(x_train,y_train)
y_pred=lr1.predict(x_test)
metrics.accuracy_score(y_pred,y_test) # More training data

0.927

In [34]:
x_train, x_test, y_train, y_test = train_test_split(    
    x_DPT, y, test_size=0.1) #random state constant
lr1 = LogisticRegression()
lr1.fit(x_train,y_train)
y_pred=lr1.predict(x_test)
metrics.accuracy_score(y_pred,y_test) 

0.881

In [35]:
NB = BernoulliNB()
NB.fit(x_train,y_train)
y_pred=NB.predict(x_test)
metrics.accuracy_score(y_pred,y_test) # Naive Bayes

0.808

In [36]:
from sklearn.ensemble import RandomForestClassifier
rf1 = RandomForestClassifier(n_estimators=100,max_features=20)
rf1.fit(x_train,y_train)
y_pred=rf1.predict(x_test)
metrics.accuracy_score(y_pred,y_test) # Random Forest

  from numpy.core.umath_tests import inner1d


0.843

In [37]:
y_pred=rf1.predict(x_train)
metrics.accuracy_score(y_pred,y_train) # Train error

0.9624444444444444

In [38]:
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=5)
neigh.fit(x_train, y_train)
y_pred=neigh.predict(x_test)
metrics.accuracy_score(y_pred,y_test) # KNN

0.869

## Predictions on test set
Finally I choose department as the feature，Reasons: 1.Low dimensionality  2. High capability in generalization

In [41]:

LR = LogisticRegression()
LR.fit(x_DPT, y)
from sklearn.externals import joblib
joblib.dump(LR, "train_model.m")

['train_model.m']

In [42]:
#Doing the same encoding as in training set, to be coherent
columns = list(pd.get_dummies(df["Courese_DPT"]).columns)
file_handle =open('DPT.txt',mode='w')
for column in columns:
    file_handle.write(column)
    file_handle.write('\n')
file_handle.close()

In [43]:
pd.get_dummies(df["Courese_DPT"]).columns

Index(['006', 'ABRD', 'ACB', 'ACCT', 'ACTS', 'AERO', 'AFAM', 'AINS', 'AMST',
       'ANIM',
       ...
       'THTR', 'TR', 'TRNS', 'UHSG', 'UICB', 'ULIB', 'URES', 'URP', 'WLLC',
       'WRIT'],
      dtype='object', length=182)

In [44]:
joblib.dump(le, "LabelEncoder.m")

['LabelEncoder.m']

In [45]:
joblib.dump(lda_clf,"LDA.m")
joblib.dump(pca,"PCA.m")

['PCA.m']

### When in use, load model and do same preprocess on test data . Get the map from id to major

In [0]:
clf = joblib.load("train_model.m") # load model
le = joblib.load("LabelEncoder.m")
test_csv = "training.csv"  #test set
test_df = pd.read_csv(test_csv,sep="|") #read data

file_handler =open('DPT.txt',mode='r')
columns = []
contents = file_handler.readlines()
for column in contents:
    column = column.strip('\n')
    columns.append(column)
file_handler.close()    

In [0]:
test_df["Courese_DPT"] = test_df["course"].apply(lambda x: x.split(sep=":")[0])#Set the department code in course code as a feature

for column in columns:
    test_df[column] = test_df["Courese_DPT"].apply(lambda x:1 if x==column else 0)#lambda x:x+1 if  2==1 else 0

In [0]:
test_df.head(5)

Unnamed: 0,student_id,level,course,grade,major,Courese_DPT,006,ABRD,ACB,ACCT,...,THTR,TR,TRNS,UHSG,UICB,ULIB,URES,URP,WLLC,WRIT
0,ppVGBRKhtqqyxnVO,Freshman,SPAN:100,A,Business,SPAN,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,PiPkSgMGbFIu5RwR,Freshman,CSI:160,S,International Relations,CSI,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,PiPkSgMGbFIu5RwR,Sophomore,EES:107,C,International Relations,EES,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,PiPkSgMGbFIu5RwR,Senior,SPAN:201,B,International Relations,SPAN,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,PiPkSgMGbFIu5RwR,Junior,ENTR:200,B+,International Relations,ENTR,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [0]:
x = test_df.copy()

for column in test_df.columns[1:6]:  #Delete all features not used
    x.drop(column,axis=1,inplace = True)
x = x.groupby(by ="student_id",as_index =False,sort = False).sum()
ids = x["student_id"]    
x.drop("student_id",axis =1,inplace = True)

In [0]:
x.head(1)

Unnamed: 0,006,ABRD,ACB,ACCT,ACTS,AERO,AFAM,AINS,AMST,ANIM,...,THTR,TR,TRNS,UHSG,UICB,ULIB,URES,URP,WLLC,WRIT
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [0]:
y_pred = clf.predict(x) 
df = {"student_id":ids,"major":le.inverse_transform(y_pred)}

  if diff:
