# Explanatory Data Analysis(EDA) with Python for Beginners

I'm doing an EDA of StudentsPerformance dataset in latest version(October 3,2020) of jupyter notebook. You can download the dataset from [here](https://www.kaggle.com/spscientist/students-performance-in-exams). I have Anoconda Navigator that lets you access applications like Jupyter Notebook, Spyder, PyCharm, RStudio and many more (you can read more about Anoconda Navigator [here](https://docs.anaconda.com/anaconda/navigator/)). Download Anoconda Navigator from [here](https://www.anaconda.com/products/individual).

**If you find this notebook helpful please upvote.**

### Importing required libraries

You can read about them in detail from the links given. 
* [panda](http://www.geeksforgeeks.org/python-data-analysis-using-pandas/)
* [numpy](https://www.tutorialspoint.com/numpy/index.htm)
* [matplotlib](https://www.datacamp.com/community/tutorials/matplotlib-tutorial-python)
* [seaborn](https://seaborn.pydata.org/). 

Most of these sites have information on all the popular python libraries so, you can choose from which site you understand the topic most and search other librarys in them.

In [None]:

import pandas as pd    # for data analysis 
import numpy as np     # to work with array
import matplotlib.pyplot as plt  # used for visualization
import seaborn as sns            # based on matplotlib provides more features for visualization
%matplotlib inline 

import os

%matplotlib inline used for for embedding the plot in notebook

# Loading and preprocessing the dataset

Now let's read the csv file and store it into a DataFrame

In [None]:
StudentsPerformance = pd.read_csv("../input/students-performance-in-exams/StudentsPerformance.csv")

In [None]:
StudentsPerformance #lets display our dataset

Our dataset has 7 attributes and 1000 records.

In [None]:
StudentsPerformance.columns  #displaying columns

In [None]:
StudentsPerformance.head(5) #displaying the top 5 rows

In [None]:
StudentsPerformance.describe() #providing the statistical overview of all the numerical columns in our dataset

count: values in each column

mean: average of values in each column

std: standard deviation

min: minimum value in each column

25%: 25% values will fall under the given value(like for math score 25% values will be less than or equals to 57.0)

50%: 50% values will fall under the given value

75%: 75% values will fall under the given value

max: maximum value in each column

##### We can also use describe function on individual column

In [None]:
StudentsPerformance['race/ethnicity'].describe() #describing race/ethnicity column

In [None]:
StudentsPerformance.sample(5) #five random rows from the dataset, helpful to get an idea about our dataset

In [None]:
StudentsPerformance.isnull().sum() #counting no of null/missing values

As no null values are present we will analyse the data now. If null values were present we would had to deal with them similar to [this way](https://www.geeksforgeeks.org/working-with-missing-data-in-pandas/). 

In [None]:
StudentsPerformance.info() #information about count of values present and their datatype

In [None]:
#information about data shape
print("No. of rows: {}".format(StudentsPerformance.shape[0]))
print("No. of columns: {}".format(StudentsPerformance.shape[1]))

### Caluculating percentage

We can make it easier to understand the performance of students by adding another column percentage.

In [None]:
StudentsPerformance['percentage'] = (StudentsPerformance['math score']+StudentsPerformance['reading score'] + StudentsPerformance['writing score'])/3
StudentsPerformance['percentage']=StudentsPerformance.percentage.round(decimals=2) #rounding up the values in percentage to two decimal place

In [None]:
StudentsPerformance.sample(10) #viewing ten random rows

In [None]:
StudentsPerformance['percentage'].describe() #describing percentage column

### Top 10 scorers

we can sort values of a column in descending order by using sort_values function and then by applying head(10) we take the top ten rows and store it in TopTenPrcntgs.

In [None]:
TopTenPrcntgs = StudentsPerformance.sort_values('percentage',ascending=False).head(10)
TopTenPrcntgs #displaying ten rows with highest percentage

### Least 10 scorers

Similarly if we sort the values in ascending order of percentage then the top ten values will be lowest percentages.

In [None]:
BottomTenPrcntgs = StudentsPerformance.sort_values('percentage', ascending=True).head(10)
BottomTenPrcntgs #displaying ten rows with lowest percentage

In [None]:
StudentsPerformance.gender.value_counts()     # no. of male and female students

In [None]:
StudentsPerformance['race/ethnicity'].value_counts()  # No. of people belonging to each race/ethnicity

# Statistical Data Analysis 

Now let's perform some statistical operations on the dataset. 

In [None]:
StudentsPerformance.mean() #Calculating mean of each column with numeric value

We can see that the average score of math test is 66.09, reading test is 69.17, writing test is 68.05, percentage is 67.77. So we can say that the students performed better in reading test. And the school needs to pay more attention to math subject.

### Using groupby for storing mean of marks w.r.t. various attributes

Let's understand groupby from below example: here we have used groupby for race/ethnicity column and this column has five unique values i.e. group A,group B,group C,group D and group E. So the rows that have same values in race/ethnicity will be grouped together(like the rows where race/ethnicity has value group A will be grouped together and so on for all other unique values).

#### We use groupby on race/ethnicity column and then store mean of scores and percentage, and display them in descending order.

group_details = StudentsPerformance.groupby('race/ethnicity')[['math score', 'reading score', 'writing score', 'percentage']].mean()
group_details.sort_values('percentage',ascending=False) 

#### Storing mean of marks and percentage w.r.t. gender and displaying then in descending order

In [None]:
gender_details = StudentsPerformance.groupby('gender')[['math score','reading score', 'writing score','percentage']].mean()
gender_details.sort_values('percentage',ascending=False)

#### Storing mean of marks and percentage w.r.t. parental level of education and displaying then in decending order

In [None]:
parental_lvl_of_edu_details = StudentsPerformance.groupby('parental level of education')[['math score','writing score','reading score','percentage']].mean()
parental_lvl_of_edu_details.sort_values('percentage',ascending=False)

#### Storing mean of marks and percentage w.r.t. lunch and displaying then in decending order

In [None]:
lunch_details = StudentsPerformance.groupby('lunch')[['math score','writing score','reading score','percentage']].mean()
lunch_details.sort_values('percentage',ascending=False)

## Visualizing Data

Visualizing data is nothing but representing data in form of graphs. Visualization is generally done to infer some new information or pattern from the data as we understand more from graphical representation. Many times the data is too big like a csv file may contain hundreds or even thousands of records and so, infering new information by just looking at the raw data might not be possible hence we perform operations on the data like data visualization to find hidden patterns and information. 

### Now let's plot some random graphs

In [None]:
sns.set_style("whitegrid") #set theme to whitegrid

Scatter plot with math score as x-axis,writing score as y-axis and color them by type of lunch they have. Read more here about [scatter plot](https://seaborn.pydata.org/generated/seaborn.scatterplot.html).

In [None]:
sns.scatterplot(x='math score',y='writing score',data=StudentsPerformance, hue= 'lunch',alpha=0.6,s=50); # alpha defines opacity and s defines size of the dots, you can play change them.
plt.title("Scatter plot example"); #giving title to graph
#we didn't need to provide label for x and y axis in seaborn 

In [None]:
#using matplotlib to plot scatter plot
fig, ax=plt.subplots() # plt.subplots() return tuple containing figure and axes object that are stored in fig and ax.

ax.scatter(StudentsPerformance['percentage'],StudentsPerformance['race/ethnicity']);
plt.xlabel('Percentage Scored');
plt.ylabel('Race/Ethnicity');

#### Plotting two histogram together to compare their data.

In [None]:
plt.hist(StudentsPerformance['reading score'],alpha=0.4);
plt.hist(StudentsPerformance.percentage,alpha=0.4);
plt.legend(['reading score','percentage']);

#### Plotting a simple bar graph showing us how many parents have what level of education.

In [None]:
StudentsPerformance['parental level of education'].value_counts().plot(kind='bar');

#### Pie plot

In [None]:
StudentsPerformance['race/ethnicity'].value_counts().plot(kind='pie');

### Pair plot with seaborn

Seaborn gives us the functionality to automatically plot different graphs from a dataset 

In [None]:
sns.pairplot(StudentsPerformance, hue='gender');

In [None]:
sns.pairplot(StudentsPerformance,hue='lunch');

# Explanatory Data Analysis (EDA)

### 1. What is the range of score for majority of students?

#### Plotting stacked histogram

In [None]:
plt.hist([StudentsPerformance['reading score'],StudentsPerformance['writing score'],StudentsPerformance['math score']], stacked=True);
plt.legend(['reading score','writing score','math score']);

Stacked bar graph puts the new segment on top of the current segment. Here we can use it to see the total marks scored by students. We can also infer that most students scored between 60-80 marks in there tests. 

### 2. From which group do students score better?

In [None]:
sns.barplot(x='race/ethnicity',y='percentage', data=StudentsPerformance);

We can see that students from group E scored the most then group D then group c then group B then group A.

### 3. Which gender performed better overall?

In [None]:
sns.barplot(x='gender',y='percentage', data=StudentsPerformance);

Girls performed better then boys

### 4. Does lunch impacts students performance?

In [None]:
sns.barplot(x='lunch',y='percentage',data=StudentsPerformance);

Yes students which were provided standard lunch performed better then students with free/reduced lunch

### 5. Do students perform better if they complete their test prepration?

In [None]:
sns.barplot(x='test preparation course',y='percentage',data=StudentsPerformance);

Yes students who complete their test preparation course perform better then students who do not complete it.

### 6. Does parents education impact their childs performance?

In [None]:
plt.figure(figsize=(10,4)); #define figure size
sns.barplot(x='parental level of education',y='percentage',data=StudentsPerformance);

As we can see the data shows that parents education does impact students performance.

## Summary

For future we could compare data from different countries to see how the students perform in developed and developing countries, is there any difference if so then how can the performance be improved.

From the given dataset StudentPerformance we can see that the performance of students depends on various factors like some of the attributes mentioned above. We could clearly see correlation between students performance and amenities provided to him. 