<h1><center>EDA and average performance prediction</center></h1>
  <h2><center>Exploratory Data Analysis and prediction of students' average performance based on socioeconomic data</center></h2>
  <img src="https://cdn.pixabay.com/photo/2019/04/14/10/27/book-4126483_960_720.jpg" />

<h1>Introduction:</h1>
<p><justify>It is estimated that some factors influence positively or negatively the performance of students in tests that assess their performance in the main student skills. The Students Performance in Exams dataset, shows some socioeconomic information, which will be explored in this notebook, with the intention of evaluating how much these resources such as, the completion of a preparatory course or how food can predict the performance in the applied tests.</justify></p> 

<h2>Libraries</h2>

In [None]:
# Data manipulation
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Plot and configure graphics
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

# Transformation of categorical variables
from sklearn.preprocessing import LabelEncoder

# prediction and evaluation the model 
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_absolute_error 

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

<h1>Exploratory Data Analysis:</h1>

In [None]:
# Load the dataset
file = ('../input/students-performance-in-exams/StudentsPerformance.csv')
org_data = pd.read_csv(file)

<h2>What is the average math, reading and writing score?</h2>

In [None]:
# Shows the main statistical data from the dataset
df = org_data.copy()
df['mean score']  = (df['math score'] + df['reading score'] + df['writing score'])/3 #Add a new column 
df.describe()

<h2>How are students divided by gender?</h2>

In [None]:
df['gender'].value_counts()

<h2>As for the groups by race / ethnicity as they are distributed in the data?</h2>

In [None]:
fig, ax = plt.subplots(figsize =(8,6)) 
sns.set_style('darkgrid')

sns.countplot(data=df, x ='gender',hue ='race/ethnicity', palette = 'cubehelix')
ax.set(title='Distribution by gender of students and race', ylabel='')

fig.tight_layout()
plt.show()

<h2>Level of education of students' paternal distributed by groups (ethnicity)</h2>

In [None]:
fig, ax = plt.subplots(figsize =(8,6)) 
sns.set_style('darkgrid')

sns.countplot(data=df, y ='parental level of education', hue= 'race/ethnicity', palette = 'cubehelix')
ax.set(title='parental level of education', ylabel='', xlabel='')

fig.tight_layout()
plt.show()

<h2>How many of the students take preparatory classes?</h2>

In [None]:
cont_test = df['test preparation course'].value_counts()
labels = df['test preparation course'].value_counts().index

fig, ax = plt.subplots(figsize =(6,6))
cor = ['palegreen','limegreen']
plt.pie(cont_test, labels =labels, autopct='%1.1f%%', colors= cor)
plt.legend(labels, loc=1)
plt.title('test preparation course')

fig.tight_layout()
plt.show()

<h2>Lunch</h2>
<p>Food is important for cognitive skills, knowing the average score obtained with the three tests, we can assess the influence of food on the score obtained by the student. Also note the comparison between those who have taken the preparatory classes or not. </p>

In [None]:
fig, ax = plt.subplots(figsize =(8,6))
sns.set_style('darkgrid')

sns.stripplot(data= df, x='lunch', y='mean score', hue= 'test preparation course', palette='cubehelix')
ax.set(title='Performance')

fig.tight_layout()
plt.show()

<h2>Is there a difference in performance by the student's gender?</h2>

In [None]:
df_nc = df.select_dtypes(['int64'])

for nc in df_nc:
    plt.subplots(figsize=(8,6))
    ax = sns.scatterplot(data= df_nc, x= df_nc[nc], y= df['mean score'], hue= df['gender'], palette= 'cubehelix')
fig.tight_layout()   
plt.show()

<h2>Lunch have more effect on which skill?</h2>

In [None]:
df_nc = df.select_dtypes(['int64'])

for nc in df_nc:
    plt.subplots(figsize=(8,6))
    ax = sns.scatterplot(data= df_nc, x= df_nc[nc], y = df['mean score'], hue = df['lunch'])
fig.tight_layout()   
plt.show()

<h1>Feature engineering:</h1>

<p>The heat map shows that student performance is related between skills, a student who is good at reading will probably be good at writing, it is curious to see a small non-correlation between math and skills such as reading and writing. The heat map draws our attention there is also a small correlation (.35) for the 'launch' feature, this is curious and should be better explored.</p>

In [None]:
# Transforming categorical features and reviewing correlations 

df_le = df.copy()
col  = ['gender', 'race/ethnicity', 'parental level of education', 'lunch','test preparation course']
for _ in col:
    df_le[_] = LabelEncoder().fit_transform(df_le[_])

corelation_2 = df_le.corr()

# correlations

fig, ax = plt.subplots(figsize =(10,8)) 
sns.set_style('darkgrid')
plt.title('correlations between features')
sns.heatmap(corelation_2, annot=True)
fig.tight_layout()
plt.show()

<h1>Prediction model:</h1>

<h2>The model</h2>

<p>The machine learning model chosen to predict the average performance of each student in the three skills was the Random Forest Regressor from the scikit-learn library. The dataset was separated into two, train and test. The model was validated using the mean absolute error (mae), and also by the model's score for both the train and test groups.</p>

In [None]:
pontos = ['math score','reading score','writing score','mean score']
X = pd.get_dummies(df).drop(pontos, axis = 1)
y = df['mean score']

X_train, X_test, y_train, y_test = train_test_split(X,y, train_size= 0.8, random_state = 1)

# model

modelo = RandomForestRegressor(random_state= 50, max_depth=5 , bootstrap=True)
modelo.fit(X_train, y_train)

y_predito = modelo.predict(X_test)

mae = mean_absolute_error(y_predito, y_test)
score_train = modelo.score(X_train, y_train)
score_test = modelo.score(X_test, y_test)

print('mean absolute error: {:.3f} \nscore_train:{:.2f}\nscore_test:{:.2f}' .format(mae, score_train, score_test))

<h2>The code below was applied to search for the most important feature.</h2>

In [None]:
# Important Features 
feature_t = np.array(modelo.feature_importances_)
feature_name = np.array(X.columns)

data = {'feature_name': feature_name, 'feature_t': feature_t}
data_frame = pd.DataFrame(data)
data_frame.sort_values(by=['feature_t'], ascending=False,inplace=True)

fig, ax = plt.subplots(figsize =(9,7)) 
sns.set_style('darkgrid')
sns.barplot(data= data_frame, x= 'feature_t', y= 'feature_name', palette='cubehelix')

ax.set(title='Important Features ', ylabel='FEATURES\n\n', xlabel='IMPORTANCES')

fig.tight_layout()
plt.show()

<h1>Conclusion:</h1>
<p>The approached dataframe shows that the performance is more related to the preparatory course than to the student's gender, since this second does not show significant relevance between the male and female students' performance. The data point to the significance of the influence of regular food and unemployment, students with standard lunch have a better score on all skills even among those who took the preparatory course for the test. The predictive model chosen was evaluated as satisfactory, having a mae of 10,064.</p>

<smal>Note: May contain beginner errors, I am available for corrections within the code. Thanks!</smal>