# Do you want to know what makes a student good at Math?
# Do you like looking at beautiful graphs and visualizations?
***If so then this Kernel is perfect for you!***

* We aim to identify, using data visualizaton, key features that affects students performance in the third math exam. To have a look at the descriptions of the attributes, check out https://www.kaggle.com/janiobachmann/math-students
* Data collected during the 2005-2006 academic year from the Alentejo region from Portugal from two schools.
* The attributes such as math marks in year 1, year 2 was collected from the school records. 
* Social/economic attributes was collected by making the students fill in a questionnaire.

* Note that the authors of the 'Using Data Mining to Predict Secondary School Student Performance' paper have collected the data. They have been formaly cited below. 
* They have also built a fairly accurate model for prediction. Hence we shall not address the issue of prediction here. 

Import the required libraries

In [None]:


import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns



import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))



Get data

In [None]:
url='/kaggle/input/math-students/student-mat.csv'
df=pd.read_csv(url)


Check out the first few rows of the data

In [None]:
df.head()

Elementary statistics associated with the data

In [None]:
df.describe()

Presence of null objects and datatypes

In [None]:
df.info()

# **Feature Engineering**

Categorical Output- 
Pass Vs Fail 
* A student fails if he/she gets below 10

In [None]:
def pass_classify(row):
    if row.G3>=10:
        return 1
    else:
        return 0
    
pass_=df.apply(pass_classify,axis='columns')
#print(pass_fail)
print(pass_.value_counts())

Categorical Output-
Grades: A,B,C,D,F
* The grading system is as given in the code below

In [None]:
def grade_classify(row):
    if row.G3>=16:
        return 'A'
    elif row.G3>=14:
        return 'B'
    elif row.G3>=12:
        return 'C'
    elif row.G3>=10:
        return 'D'
    else:
        return 'F'
    

grades=df.apply(grade_classify,axis='columns')
print(grades.value_counts())


Helper function to find percentage of people passed and grade percentage

In [None]:
def get_percent(col):
    return (col.value_counts()/col.value_counts().sum())*100


In [None]:
pass_percent=get_percent(pass_)
print(pass_percent)

In [None]:
grade_percent=get_percent(grades)
print(grade_percent)

In [None]:
sns.distplot(a=df['G3'], kde=False)

* 1/3rd of the people have failed in the exam, which is quite surprising. 
* More than half the people have recieved a D and a F grade

In [None]:
df['grades']=grades
print(grades)

In [None]:
df.info()

# **Cramer's V correlation matrix**

We use the Cramer's V correlation matrix to identify correlation b/w categorical features. 

1. We firstly remove all the continous attributes from our dataframe

In [None]:
df_cat = df[[i for i in df.columns if i not in ('G1','G2','G3','absences')]]
df_cat.head()

2. Label Encoding of the categorical features is then done

In [None]:
from sklearn import preprocessing

label = preprocessing.LabelEncoder()
data_encoded = pd.DataFrame() 

for i in df_cat.columns :
  data_encoded[i]=label.fit_transform(df_cat[i])

In [None]:
data_encoded.head()

3. Building of the Cramer's V function

In [None]:
from scipy.stats import chi2_contingency
import numpy as np




def cramers_V(var1,var2) :
  crosstab =np.array(pd.crosstab(var1,var2, rownames=None, colnames=None)) # Cross table building
  stat = chi2_contingency(crosstab)[0] # Keeping of the test statistic of the Chi2 test
  obs = np.sum(crosstab) # Number of observations
  mini = min(crosstab.shape)-1 # Take the minimum value between the columns and the rows of the cross table
  return (stat/(obs*mini))

4. Building of the matrix

In [None]:
rows= []

for var1 in data_encoded:
  col = []
  for var2 in data_encoded :
    cramers =cramers_V(data_encoded[var1], data_encoded[var2]) # Cramer's V test
    col.append(round(cramers,2)) # Keeping of the rounded value of the Cramer's V  
  rows.append(col)
  
cramers_results = np.array(rows)
df_var = pd.DataFrame(cramers_results, columns = data_encoded.columns, index =data_encoded.columns)



df_var

5. Add a heatmap to the matrix

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(20,20))
plt.title("Heatmap of categorical variables")
sns.heatmap(data=df_var,vmin=0, vmax=1,annot=True)

plt.show()


* We see that there is almost a negligible relationship with all the categorical attributes and grades

# **Scatter Plots**

In [None]:
x1=df['G1']
x2=df['G2']
y=df['G3']
plt.figure(figsize=(10,6))
plt.title("Year 3 grade VS Year1 and Year2 grade")
g1=plt.scatter(x1,y,marker='x')
g2=plt.scatter(x2,y,marker='o')
plt.legend((g1, g2),
           ('Year1 Grade', 'Year2 Grade'),
           scatterpoints=1,
           loc='upper left',
           ncol=3,
           fontsize=8)
plt.xlabel("Marks(out of 20) of year1/year2")
plt.ylabel("Marks(out of 20) of year 3")
plt.show() 



* The previous grades G1 and G2 seem to be positively linearly correlated to G3

# **No of absent days VS Marks scatter plot**

In [None]:
plt.figure(figsize=(10,6))
plt.title("No of absent days VS Marks")
plt.scatter(df["absences"],df["G3"])

plt.ylabel("Marks(out of 20) of year 3")
plt.xlabel("Absent days")
plt.show()

* There doesn't seem to be much a relation with no of absent days and marks. However since most students have taken only between 0 to 10 days of leave, we might not have enough information to conclude much

# **Boxplots**

In [None]:
#plt.figure(figsize=(12,7))
plt.title("Sex vs Math marks in final exam")
ax = sns.boxplot(x="sex", y="G3", data=df)
plt.ylabel("Marks in final exam")
plt.show()


plt.title("Daily drinking VS Math marks")
ax = sns.boxplot(x="Dalc", y="G3",data=df)
plt.xlabel("Daily alcohol consumption")
plt.ylabel("Marks in final exam")
plt.show()

plt.title("Quality of family relationship VS Math marks")
ax=sns.boxplot(x="famrel",y="G3",data=df)
plt.ylabel("Marks in final exam")
plt.xlabel("Quality of family relationship")
plt.show()


ax = sns.boxplot(x="romantic", y="G3", hue="goout",
                 data=df)
plt.ylabel("Marks in final exam")
plt.xlabel("Involved in a romantic relationship")
plt.show()


ax=sns.boxplot(x="reason",y="G3",data=df)
plt.ylabel("Marks in final exam")
plt.xlabel("Reason to choose this school")
plt.show()





* The boxplots don't give much information as the categorical factors don't seem to be correlated with the final exam marks

# 2D KDE plots

We have a look at a 2D KDE plot between marks in the final exam vs the first/second exam respectively

In [None]:

sns.jointplot(x=df['G1'], y=df['G3'], kind="kde")
#plt.title("2D KDE plot b/w marks in first exam vs marks in final exam")
plt.xlabel('Marks in first exam')
plt.ylabel('Marks in final exam')
plt.show()



sns.jointplot(x=df['G2'], y=df['G3'], kind="kde")
#plt.title("2D KDE plot b/w marks in second exam vs marks in final exam")
plt.xlabel('Marks in second exam')
plt.ylabel('Marks in final exam')
plt.show()


* This shows in a very beautiful manner the linear relationship present b/w the marks in final exam VS the marks in the first/second exams respectively
* Also the data is centered around (10,10) in both plots

**Converting Categorical data to Numerical data**

We want to use label encoding for ONLY those labels in which the labels can be compared to one another. For example if the labels were short and tall, we would assign 0 to short and 1 to tall

We label encode the 'romantic' attribute('romantic'='yes' if student is in a romantic relationship, else it is 'no'). We want to assign 0 to 'no' and 1 to 'yes'

In [None]:
from sklearn import preprocessing 
  
# label_encoder object knows how to understand word labels. 
label_encoder = preprocessing.LabelEncoder() 
  
# Encode labels in columns

romantic_no=label_encoder.fit_transform(df['romantic'])

print(romantic_no[0])
print(df['romantic'][0])


Our assignment, by luck turned out to be correct. We could have gotten 1 to be assigned to no instead of 0, then we should have modified the labelling

In [None]:
df['romantic']=romantic_no

Converting more categorical labels to numeric(by checking if encoding is right)

In [None]:
famsize_no=label_encoder.fit_transform(df['famsize'])
df['famsize']=1-famsize_no

activities_no=label_encoder.fit_transform(df['activities'])
df['activites']=activities_no

df['Pstatus']=label_encoder.fit_transform(df['Pstatus'])
df['nursery']=label_encoder.fit_transform(df['nursery'])
df['internet']=label_encoder.fit_transform(df['internet'])
df['higher']=label_encoder.fit_transform(df['higher'])
df['schoolsup']=label_encoder.fit_transform(df['schoolsup'])
df['famsup']=label_encoder.fit_transform(df['famsup'])
df['paid']=label_encoder.fit_transform(df['paid'])


In [None]:
print(df.head())

In [None]:
df.info()

# **Bar Plots**

* We first group the data based on the grades obtained by the student. 
* We wish to see whether we can predict any attribute of the student given their grade.

In [None]:
grouped_df=df.groupby('grades')
print(grouped_df['freetime','famrel','goout','romantic','Pstatus','activities','paid'].mean())


# **Relationship Quotient Vs Grade**

Relationship Quotient of a set of students is defined as the average no of students in that set in a romantic relationship

In [None]:
plt.figure(figsize=(10,6))
plt.title("Relationship Quotient Vs Grade")
sns.barplot(x=["A","B","C","D","F"], y=grouped_df['romantic'].mean())
plt.ylabel("Relationship Quotient")
plt.xlabel("Grade")

* The students that get A grade seem to be less likely to be in a romantic relationship than others

# **Free Time Vs Grade**

In [None]:
plt.figure(figsize=(10,6))
plt.title("Free Time Vs Grade")
sns.barplot(x=["A","B","C","D","F"], y=grouped_df['freetime'].mean())
plt.ylabel("Free Time")
plt.xlabel("Grade")

* There doesn't seem to be any relationship amongst the variables (This was what we concluded after observing the heat map)

# **Go out Vs Grade**

In [None]:
plt.figure(figsize=(10,6))
plt.title("Goout Vs Grade")
sns.barplot(x=["A","B","C","D","F"], y=grouped_df['goout'].mean())
plt.ylabel("GO out")
plt.xlabel("Grade")

* There doesn't seem to be any relationship amongst the variables (This was what we concluded after observing the heat map)

# Conclusions from Exploratory Data Analysis:
1. Most categorical features don't seem to affect student's performance in exams,when compared to the continous attributes.This is quite surprising, because apriori one would assume social factors such as quality of family relationship/ whether the student goes to paid tuitons or not,etc would have played a big factor
2. The categorical attributes also don't seem to be correlated with one another. 
3. The most important factors which determine student's performance in the final math exam seem to be student's previous performance in the previous 2 exams


# Citation:
1. P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7.
2. Cramer's V Correlation matrix code -: https://www.kaggle.com/chrisbss1/cramer-s-v-correlation-matrix


# Things to Improve/build on:
* Use data from different places around the world and different age groups(eg pre-school/school/university life) to check if the conclusions we derieved from this dataset are universal truths


