# STUDENTS PERFORMANCE IN EXAMS 

***Task : Predicting student performance with the demographic and socioeconomic information.***


![](https://storage.googleapis.com/kaggle-datasets-images/74977/169835/f2893f90d8f6c135baf743f4a135761a/dataset-cover.jpg?t=2018-11-10-03-10-57)

# IMPORTING THE NECESSARY LIBRARIES

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import scipy as sp
import re
import time
import matplotlib.pyplot as plt
import seaborn as sns
import os
import matplotlib as mpl
mpl.rcParams.update(mpl.rcParamsDefault)
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory



# READING THE DATA 

In [None]:
data=pd.read_csv("../input/students-performance-in-exams/StudentsPerformance.csv")

In [None]:
data.head()

In [None]:
data

***We have 1000 rows and 9 columns out of which 3 are int type and rest are of object type***

In [None]:
data.info()

**So our data frame contain only Non NULL Values**


In [None]:
data.isnull().sum()

# QUICK VISUALIZATIONS

In [None]:
sns.set_style('darkgrid')
sns.countplot(y='gender',data=data,palette='colorblind')
plt.xlabel('Count')
plt.ylabel('Gender')
plt.show()

In [None]:
#calculating the female and male count
female_count = len(data[data['gender']=='female'])
male_count = 1000 - female_count
print("female count is:",female_count,"\n","male count is:",male_count)

***Out of 1000 student, 518 are female and 482 are male so our data frame has almost same-gender count with the females slightly more***

In [None]:
sns.set_style('whitegrid')
sns.countplot(x='race/ethnicity',data=data,palette='colorblind')
plt.xlabel("Race/Ethnicity")
plt.ylabel("Count")
plt.show()

**So Our data frame consists of 5 race group out of which group C are in the majority and group A is in the minority**

In [None]:
sns.set_style('whitegrid')
sns.countplot(y='parental level of education',data=data,palette='colorblind')
plt.xlabel("Count")
plt.ylabel("Parental Level of Education")
plt.show()

***So we find that most of the student's parents went to some college or have an associate's degree few have a bachelor or a masters degree***

In [None]:
sns.set_style('whitegrid')
sns.countplot(y='lunch',data=data,palette='colorblind')
plt.xlabel("Count")
plt.ylabel("Lunch")
plt.show()

In [None]:
sns.set_style('whitegrid')
sns.countplot(y='test preparation course',data=data,palette='colorblind')
plt.ylabel("Test Preparation Course")
plt.xlabel("Count")
plt.show()

***So we find that most student didn't prepared for the test***

***Now, Let's find if there is a relationship between student score in different subjects***

In [None]:
sns.set_style('darkgrid')
plt.title('Maths score vs Reading score',size=16)
plt.xlabel('Maths Score',size=12)
plt.ylabel('Reading Score',size=12)
sns.scatterplot(x='math score',y='reading score',data =data,hue='gender',edgecolor='black',palette='cubehelix',hue_order=['male','female'])
plt.show()

In [None]:
sns.set_style('whitegrid')
plt.title('Maths score vs Writing score',size=16)
plt.xlabel('Maths score',size=12)
plt.ylabel('Writing score',size=12)
sns.scatterplot(x='math score',y='writing score',data =data,hue='gender',s=90,edgecolor='black',palette='cubehelix',hue_order=['male','female'])
plt.show()

In [None]:
sns.set_style('whitegrid')
plt.title('Reading score vs Writing score',size=16)
plt.xlabel('Reading score',size=12)
plt.ylabel('Writing score',size=12)
sns.scatterplot(x='reading score',y='writing score',data =data,hue='gender',s=90,edgecolor='black',palette='colorblind',hue_order=['male','female'])
plt.show()

***Student score in maths vs (reading and writing) are little spread out but they generally follow an uptrend so if a student score more in maths he/she will also generally score more in other subjects. While scores in reading vs writing are more linear.***

In [None]:
#total marks are score of all subjects out of 100
total_marks = ((data['math score'] + data['reading score'] + data['writing score'])/300)*100 
data['total_marks'] = total_marks
kde_data = data[['math score','reading score','writing score','total_marks']]

**Now, Let's find out how other features affect total marks**

In [None]:
sns.set_style("darkgrid")
sns.kdeplot(data=kde_data,shade=True,palette='colorblind')
plt.show()

In [None]:
sns.catplot(x='race/ethnicity',y='total_marks',data =data,hue='test preparation course',palette='colorblind',kind='box',showfliers=False)
plt.xlabel("Race/Ethnicity")
plt.ylabel("Total Marks")
plt.show()

In [None]:
plt.rcParams["figure.figsize"] = (8, 4)  #we can use also use plt.figure(figsize=(8,4))
plt.rcParams["xtick.labelsize"] = 5
order = ["master's degree","bachelor's degree","associate's degree","some college","high school","some high school"]
sns.catplot(x='parental level of education',y='total_marks',data =data,hue='test preparation course',order=order,palette='Dark2_r',kind='box',showfliers=False)
plt.xlabel("Parental Level of Education")
plt.ylabel("Total Marks")
plt.show()

In [None]:
sns.catplot(x='parental level of education',y='total_marks',hue='lunch',data=data,order=order,palette='cubehelix')
plt.xlabel("Parental Level of Education")
plt.ylabel("Total Marks")
plt.show()

In [None]:
mpl.rcParams.update(mpl.rcParamsDefault)

**Some Conclusions that we made are :**

* Student with standard lunch tend to score more on average
* Student on group B scored lowest while student of group E scored highest
* Student test prepartion is affected by parents education**

# MODEL 

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge      # Ridge is an extension of Linear Regression
from sklearn.metrics import mean_squared_error

In [None]:
data

In [None]:
data.value_counts

In [None]:
data_model = data.drop(['math score','reading score','writing score'],axis=1)

In [None]:
y = data_model['total_marks']
data_model = data_model.drop('total_marks',axis=1)

In [None]:
data_model = pd.get_dummies(data_model)
data_model

In [None]:
x_train,y_train,x_test,y_test = train_test_split(data_model,y,test_size=0.2,random_state=42)

In [None]:
model = Ridge()           # Ridge is an extension of Linear Regression
model.fit(x_train,x_test)
pred = model.predict(y_train)
train_pred = model.predict(x_train)
score =  mean_squared_error(y_test,pred,squared=False)
score

# FEATURE IMPORTANCE PLOT

In [None]:
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
model.fit(x_train,x_test)
feature_importance = np.array(model.feature_importances_)
feature_names = np.array(x_train.columns)
data={'feature_names':feature_names,'feature_importance':feature_importance}
df_plt = pd.DataFrame(data)
df_plt.sort_values(by=['feature_importance'], ascending=False,inplace=True)
plt.figure(figsize=(8,6))
sns.barplot(x=df_plt['feature_importance'], y=df_plt['feature_names'])
plt.style.use("ggplot")
plt.xlabel('FEATURE IMPORTANCE')
plt.ylabel('FEATURE NAMES')
plt.show()

# THANK YOUU.......

**IF YOU LIKE THE KERNEL PLEASE UPVOTE !!!!**

# **KEEP KAGGLING !**