**Problem Statement**: to figure out if a correlation exists between the different attributes that are in the dataset.

For example:

* Gender and reading score
* Race and Math Score
* Lunch and Writing Score

#### Importing Essential libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

#### Loading the dataset


In [None]:
data = pd.read_csv('../input/students-performance-in-exams/StudentsPerformance.csv')

### Exploring the dataset


In [None]:
# Returns the first x number of rows when head(num). Without a number it returns 5
data.head()

In [None]:
# Returns number of rows and columns of the dataset
data.shape

In [None]:
# Returns an object with all of the column headers 
data.columns

In [None]:
# returns the number of null values present in each column
data.isnull().sum()

In [None]:
# Returns different datatypes for each columns (float, int, string, bool, etc.)
data.dtypes

In [None]:
# Returns basic statistics on numeric columns
data.describe()

In [None]:
# figuring out number of unique values  and the unique values in each column having datatype as object
for column in data.columns:
    if data[column].dtypes == 'O':
        print(f"For {column}, Number of Unique values: {data[column].nunique()} and the Unique values are: {data[column].unique()}")

In [None]:
# to make the plots more visible.
plt.rcParams['figure.dpi'] = 200

### Figuring out the correlation of 'gender' with the math score, reading score and writing score

In [None]:
# the number of "male" and 'female' in the 'gender' column
data.gender.value_counts()

In [None]:
# Plotting histogram of numerical variable and checking if there is any correlation with the categorical features.
def print_plot_and_display_means(numerical_variable, data, categorical_variable):
    """to plot histogram of various subject test score and seeing it's distribution with various categorical features"""
    sns.histplot(data=data, x=numerical_variable, hue=categorical_variable, multiple="dodge")
    print(data.groupby(categorical_variable)[numerical_variable].mean())

In [None]:
# Plotting histogram of 'math score' and checking if there is any correlation with 'gender'
print_plot_and_display_means('math score', data, 'gender')

**Conclusion: male(68.72) on average scores 5 points more than female(63.63) in the math section.**

In [None]:
# Plotting histogram of 'reading score' and checking if there is any correlation with 'gender'   
print_plot_and_display_means('reading score', data, 'gender')

**Conclusion: female(72.6) on average scores 7 points more than male(65.47) in the reading section.**

In [None]:
# Plotting histogram of 'writing score' and checking if there is any correlation with 'gender'   
print_plot_and_display_means('writing score', data, 'gender')

**Conclusion: female(72.46) on average scores 9 points more than male(63.31) in the writing section.**

### Figuring out the correlation of 'race/ethnicity' with the math score, reading score and writing score

In [None]:
# the number of different ethnicities in the 'race/ethnicity' column
data["race/ethnicity"].value_counts()

In [None]:
# Plotting histogram of 'math score' and checking if there is any correlation with 'race/ethnicity'
print_plot_and_display_means('math score', data, 'race/ethnicity')

**Conclusion: Group E(73.82) score higher the other groups in the math section.**

In [None]:
# Plotting histogram of 'reading score' and checking if there is any correlation with 'race/ethnicity'   
print_plot_and_display_means('reading score', data, 'race/ethnicity')

**Conclusion: Group E(73.02) score higher the other groups in the reading section.**

In [None]:
# Plotting histogram of 'writing score' and checking if there is any correlation with 'race/ethnicity'   
print_plot_and_display_means('writing score', data, 'race/ethnicity')

**Conclusion:Group D(70.14) and Group E(71.4) score higher the other groups in the writing section.**

### Figuring out the correlation of 'parental level of education' with the math score, reading score and writing score

In [None]:
# the number of different education levels in the "parental level of education" column
data["parental level of education"].value_counts()

In [None]:
# Plotting histogram of 'math score' and checking if there is any correlation with 'parental level of education'
print_plot_and_display_means('math score', data, 'parental level of education')

**Conclusion: The students of parents with associate's degree, bachelor's degree, master's degree, some college score higher than students of parents with some high school and high school level of education in the math test.**

In [None]:
# Plotting histogram of 'reading score' and checking if there is any correlation with 'parental level of education'   
print_plot_and_display_means('reading score', data, 'parental level of education')

**Conclusion: The students of parents with associate's degree, bachelor's degree, master's degree, some college score higher than students of parents with some high school and high school level of education in the reading test.**

In [None]:
# Plotting histogram of 'writing score' and checking if there is any correlation with 'parental level of education'   
print_plot_and_display_means('writing score', data, 'parental level of education')

**Conclusion: The students of parents with associate's degree, bachelor's degree, master's degree, some college score higher than students of parents with some high school and high school level of education in the writing test.**

### Figuring out the correlation of 'lunch' with the math score, reading score and writing score

In [None]:
# the number of different lunch options in the 'lunch' column
data["lunch"].value_counts()

In [None]:
# Plotting histogram of 'math score' and checking if there is any correlation with 'lunch'
print_plot_and_display_means('math score', data, 'lunch')

**Conclusion: The students who opted for 'standard' lunch options score higher than students who opt for 'free/reduced' lunch options in the math test.**

In [None]:
# Plotting histogram of 'reading score' and checking if there is any correlation with 'lunch'   
print_plot_and_display_means('reading score', data, 'lunch')

**Conclusion: The students who opted for 'standard' lunch options score higher than students who opt for 'free/reduced' lunch options in the reading test.**

In [None]:
# Plotting histogram of 'writing score' and checking if there is any correlation with 'lunch'   
print_plot_and_display_means('writing score', data, 'lunch')

**Conclusion: The students who opted for 'standard' lunch options score higher than students who opt for 'free/reduced' lunch options in the writing test.**

### Figuring out the correlation of 'test preparation course' with the math score, reading score and writing score

In [None]:
# the number of different test preparation options in the 'test preparation course' column
data["test preparation course"].value_counts()

In [None]:
# Plotting histogram of 'math score' and checking if there is any correlation with 'test preparation course'
print_plot_and_display_means('math score', data, 'test preparation course')

**Conclusion: The students who "completed" the "test preparation course" score higher than students who have "none" test preparation in the math section.**

In [None]:
# Plotting histogram of 'reading score' and checking if there is any correlation with 'test preparation course'   
print_plot_and_display_means('reading score', data, 'test preparation course')

**Conclusion: The students who "completed" the "test preparation course" score higher than students who have "none" test preparation in the reading section.**

In [None]:
# Plotting histogram of 'writing score' and checking if there is any correlation with 'test preparation course'   
print_plot_and_display_means('writing score', data, 'test preparation course')

**Conclusion: The students who "completed" the "test preparation course" score higher than students who have "none" test preparation in the writing section.**

### Final Conclusions about the tests
* Female score higher than Male.
* Group E perform better than other groups.
* The students of parents with associate's degree, bachelor's degree, master's degree, some college score higher than students of parents with some high school and high school level of education.
*  The students who opted for 'standard' lunch options score higher than students who opt for 'free/reduced'.
* The students who "completed" the "test preparation course" score higher than students who have "none" test preparation.