## Students Performace in exames

Data analysis of students performance in a set of exames.

### Objective

The main point is understand how some features influence in the students performace.

### Imports

In [None]:
import pandas as pd 
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

### Loading data

Source: https://www.kaggle.com/spscientist/students-performance-in-exams

In [None]:
df = pd.read_csv("../input/StudentsPerformance.csv")

First view.

In [None]:
df.head()

### Evaluating the data

Information about this data.

In [None]:
df.info(0)

Changing some features names just to facilitate my analysis and avoid mistakes. I had some with */* and dashs in some names.
I tried not to change the meanings of columns names.

In [None]:
df = df.rename(columns={'race/ethnicity':'ethnicity', 'parental level of education' : 'parents_education', 'test preparation course':'test_preparation_course', 'math score' : 'math_score', 'reading score' : 'reading_score', 'writing score' : 'writing_score' })

In [None]:
df.head()

Checking null values.

In [None]:
df.isnull().sum()

### Data Analysis

First, let's see which characteristics are in each feature.

In [None]:
df.ethnicity.value_counts()

In [None]:
df.parents_education.value_counts()

In [None]:
df.lunch.value_counts()

In [None]:
df.test_preparation_course.value_counts()

Analyzing the scores.

In [None]:
df.describe()

Analyzing this table we can already highlight some important points.

* Some students aced the exames
* The average of the results of students is approximately 68 and having a standard deviation of 15
* The majority of scores is above 57

Using the American system of grades to divide the students into groups according to their performance
This system divides the students at intervals of score:

* Score >= 90 - A
* 90 > score >= 80 - B
* 80 > score >= 70 - C
* 70 > score >= 60 - D
* 60 > score - E/F

Therefore the best students are those who have the average of the scores above 90, so having a grade A.
For students who have the average of the scores below 60 are considered students with a bad performance, so having a grad E/F. Portanto os alunos que não tem um desempenho aceitável vai possuir esse critério.

Fonte:
https://nces.ed.gov/nationsreportcard/hsts/howgpa.aspx

Creating a columns with the averages of the scores.

In [None]:
df['mean_score'] = df.mean(axis=1)

Creating a columns with grades.

In [None]:
# Criando a função para a converção
def ScoretoGrade(mscore):    
    if (mscore >= 90 ):
        return 'A'
    if (mscore >= 80):
        return 'B'
    if (mscore >= 70):
        return 'C'
    if (mscore >= 60):
        return 'D'
    else: 
        return 'E/F'

# Criando a coluna com as notas novas    
df['grade'] = df.apply(lambda x : ScoretoGrade(x['mean_score']), axis=1)
            

Checking the new features.

In [None]:
df.head()

Creating  datasets for good and bad students.

In [None]:
# Alunos bons
Top_students = df[df.grade == 'A']
# Alunos ruins
Fail_students = df[df.grade == 'E/F']

Checking the new datasets.

In [None]:
Top_students.head()

In [None]:
Fail_students.head()

Let's see how the students scores as distributed in each subject through a histrogram. Let's add lines with the mean values of the students in each subject and the minimum score acceptable, 60, which would result in a grande E/F.

Creating variables with the avarege score of each subject.

In [None]:
mean_math = df['math_score'].mean()
mean_reading = df['reading_score'].mean()
mean_writing = df['writing_score'].mean()

Creating a histogram about math scores.

In [None]:
# plotando os gráficos
plt.hist(df['math_score'], rwidth=0.9, edgecolor='k')
# Adicionando as legendas
plt.xlabel('Score')
plt.ylabel('Frequency')
# Adicionando o titulo
plt.title('Histrogram Math Score')
# Adicionando a linha de média
plt.axvline(mean_math, color = 'k', linestyle='dashed', linewidth=3)
# Adicionando a linha de nota minima
plt.axvline(60, color = 'r', linestyle='dashed', linewidth=3)
# Adicionando legendas
plt.legend(('mean','scores'))
plt.show()

Creating a histogram about reading scores.

In [None]:
plt.hist(df['reading_score'], rwidth=0.9, edgecolor='k')
plt.xlabel('Score')
plt.ylabel('Frequency')
plt.title('Histrogram Reading Score')
plt.axvline(mean_reading, color = 'k', linestyle='dashed', linewidth=2.5)
plt.axvline(60, color = 'r', linestyle='dashed', linewidth=3)
plt.legend(('Mean','Acceptable Score'))
plt.show()

Creating a histogram about writing scores.

In [None]:
plt.hist(df['writing_score'], rwidth=0.9, edgecolor='k')
plt.xlabel('Score')
plt.ylabel('Frequency')
plt.title('Histrogram Writing Score')
plt.axvline(mean_writing, color = 'k', linestyle='dashed', linewidth=3)
plt.axvline(60, color = 'r', linestyle='dashed', linewidth=3)
plt.legend(('Mean','Acceptable Score'))
plt.show()

These histograms gave these conclusion:

* In all subjects the majority of students took scores above the acceptable
* The performance of students was better in writing
* The performance of students was worse in math

Counting the students by dividing the grades.

In [None]:
df.grade.value_counts()

Evaluating graphically.

In [None]:
sns.countplot(x="grade", data = df, order=['A','B','C','D','E/F'],  palette="muted")
plt.title('Grade count')
plt.show()

We can see that most of the students had notes above E/F, but the group with the largest number of students is the E/F group. We can also see the number of students with the A concept is very small if compared with the others.

Analyzing the students in group A:

Ethnicity of students

In [None]:
sns.countplot(x='ethnicity', data = Top_students, palette="muted")
plt.title('Ethnicity of Top students')
plt.show()

The exams preparation of students:

In [None]:
plt.pie(Top_students.test_preparation_course.value_counts(), labels=['none','completed'], autopct='%1.1f%%', colors = ['magenta', 'cyan'])
my_circle=plt.Circle( (0,0), 0.75, color='white')
p=plt.gcf()
p.gca().add_artist(my_circle)
plt.axis('equal')
plt.title('Preparation of Top students')
plt.show()

Lunch education:

In [None]:
plt.pie(Top_students.lunch.value_counts(), labels=['standard','free/reduce'], autopct='%1.1f%%')
my_circle=plt.Circle( (0,0), 0.8, color='white')
p=plt.gcf()
p.gca().add_artist(my_circle)
plt.axis('equal')
plt.title('Lunch of Top Students')
plt.show()

Level of parentes education of students:

In [None]:
p = sns.countplot(x='parents_education', data = Top_students, palette="muted")
plt.setp(p.get_xticklabels(), rotation=45)
plt.title('Parents Education of Top students')
plt.show()

The conclusion about students with grades A:
* The majority has ethnicity C and the minoruty ethnicity E
* The majority of students didn't test preparations for the exams
* The majority of students has a standard lunch
* The most part of the parents of students studied beyond the high school level, being the majority with associate's degree


Analyzing the students in group E/F:

Ethnicity of students

In [None]:
sns.countplot(x='ethnicity', data = Fail_students, palette="muted")
plt.title('Ethnicity of students with low grade')
plt.show()

The exams preparation of students:

In [None]:
plt.pie(Fail_students.test_preparation_course.value_counts(), labels=['none','completed'], autopct='%1.1f%%', labeldistance = 1.1,colors = ['magenta', 'cyan'])
my_circle=plt.Circle( (0,0), 0.79, color='white')
p=plt.gcf()
p.gca().add_artist(my_circle)
plt.axis('equal')
plt.title('Preparation of students with low grade')
plt.show()

Lunch education:

In [None]:
plt.pie(Fail_students.lunch.value_counts(), labels=['standard','free/reduce'], autopct='%1.1f%%')
my_circle=plt.Circle( (0,0), 0.75, color='white')
p=plt.gcf()
p.gca().add_artist(my_circle)
plt.axis('equal')
plt.title('Lunch of students with low grade')
plt.show()

Level of parentes education of students:

In [None]:
p = sns.countplot(x='parents_education', data = Fail_students, palette="muted")
plt.setp(p.get_xticklabels(), rotation=45)
plt.title('Parents Education of students with low grade')
plt.show()

The conclusion about students with grades E/F:

* The majority has ethnicity C and the minoruty ethnicity E, similar to the students in group A
* The majority of students didn't test preparations for the exams
* The values about lunch are practically the same, but the majority is standard
* The most part of parents of students completed or not the High School level

In [None]:
df_f = df[df.gender == 'female']
df_m = df[df.gender == 'male']

In [None]:
df_f.gender[df_f.gender == 'male'].count()

In [None]:
df_m.gender[df_m.gender == 'female'].count()

The last analysis will be the genera, which had the best performances

About Math:

In [None]:
sns.distplot(df_m['math_score'])
sns.distplot(df_f['math_score'])
plt.title('Histogram of math score by gender')
plt.legend(('Male','Female'))
plt.show()

About reading:

In [None]:
sns.distplot(df_m['reading_score'])
sns.distplot(df_f['reading_score'])
plt.title('Histogram of reading score by gender')
plt.legend(('Male','Female'))
plt.show()

About writing:

In [None]:
sns.distplot(df_m['writing_score'])
sns.distplot(df_f['writing_score'])
plt.title('Histogram of writing score by gender')
plt.legend(('Male','Female'))
plt.show()

The comparison of averages of the scores by genre:

In [None]:
sns.distplot(df_m['mean_score'])
sns.distplot(df_f['mean_score'])
plt.title('Histogram of mean score by gender')
plt.legend(('Male','Female'))
plt.show()

The conclusion with these histograms:

* The values about scores in all subjects are practically the same, but the performance of females is a little better
* Females have a better performance in writing and reading
* Males have a better performance in math

### Final conclusions

In general, most part of students had good scores in all examens, the distributions of scores are similar too. Math was the subject that students had the worst performance and writing has the best performance. In the analysis by gender, the performance are practically the same between male and famele, but the performance of females is a little better. While fameles were better in writing and reading, males were better in math, but nothing was so different in each subject.

After conversion of the avarege scores in grades happened a separation of students by theses grades.The best students had grades A and the worst grades E/F. Although the majority of students had grades above E/F, the group E/F had the largest number of students.

When is made a comparasion about the characteristics between students in the group A and E/F, ethnicity was not relevant since both groups had the same results, the most part of the students are in the group C of ethnicity and teh least part in the group E of ethnicity. The same situation happened with the comparasion about exams preparations, in boths groups the students didn't a exam preparation.

The features that had differents results were lunch and the level of parents education. In lunch, the group of students A had standard lunch and the group E/F had a result almost divided, but the most part had standart lunch. In the level of parents education, the most part of parents of group A studied something after the high school, this is different with parents of group E/F. The most part studied or not completed the high school.

