# <b>Exploratory Data Analysis</b>
##### • EDA is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task. <a href='https://en.wikipedia.org/wiki/Exploratory_data_analysis'>Wikipedia</a>

##### • We will use "Student Performance in Exams" dataset from Kaggle. <a href='https://www.kaggle.com/spscientist/students-performance-in-exams'>Data source </a>

<br>

#### <b>The main purposes for this analysis is to:</b>
##### <i>1. To better understand and get familiar with the data.</i>
##### <i>2. To indentify and clean any defeacts in the data such as missing values and outliers.</i>
##### <i>3. To indentify any patterns, trends, and interesting facts that lie within the data.</i>
- what variables contribute to students' test scores?
- Is the test prep course effective?
- etc


## <b>Firtst glance at the data</b>

In [None]:
#Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
#Get the data and look at the first 5 rows
df = pd.read_csv('../input/students-performance-in-exams/StudentsPerformance.csv')
df.head()

In [None]:
#Get the shape of data
print(f'Shape of data: {df.shape}')
print(f'Number of rows: {df.shape[0]}')
print(f'Number of columns: {df.shape[1]}')
print(f'Number of dimensions: {df.ndim}')

In [None]:
#column names
print(f'Column names are: {df.columns}')

<b>• Some columns include space or '/', which I don't prefer to have in column names. So I'll replace that with '_'.</b>

In [None]:
#replace ' ' and '/' with '_'
df.columns = df.columns.str.replace(' ', '_').str.replace('/', '_')
df.columns

## <b>Types of Data</b>

In [None]:
#Look at dtypes
print(df.dtypes)
print('\n')
print(df.dtypes.value_counts())

In [None]:
#only show numerical dtypes
print(df.select_dtypes(include='number').head())

print('\n')

#only show non-numerical dtypes
print(df.select_dtypes(include='object').head())

In [None]:
#Look at the dtypes, missing values as well as memory usage at the same time.
df.info()

<b>- Some dtype of object can probably be converted to dtype of category for it's more efficient in terms of memory usage.</b>

<b>- It looks like there is no missing values, but we want to make sure that's true.</b>

## <b>Memory Usage</b>

<b>• As we looked at the dtypes, there were only 'object' and 'int64'.</b>
<br>
<b>• We will inspect if we really need them to be 'object'/'int64'.</b>

In [None]:
df.info()

In [None]:
#See memory usage for each column
df.memory_usage(deep=True)

<b>- Obviously, dtype of 'object' takes up a lot of memory. We may convert it to 'category' if it doesn't contain a lot of value_counts. </b>

### • Let's first look at non-numerical values.

In [None]:
#Let's look at the value counts for each.
for col in df.select_dtypes(include='object').columns:
  print(f'---Value counts for {col}---\n {df[col].value_counts()}. \n\n')

In [None]:
#You can simple do the same by calling this
df.select_dtypes(include='object').nunique()

<b>- Looks like these are more of categorical dtype than simply object (string) dtype, with a hand full of unique values for each. </b><br>
<b>- So we'll change dtype to category so that memory usage'd be more efficient.</b>

In [None]:
#check current memory usage
df.select_dtypes(include='object').memory_usage(deep=True)

In [None]:
#look at how efficient it'd be when converted to category dtype
df.select_dtypes(include='object').astype('category').memory_usage(deep=True)

In [None]:
#covert object to categorical
non_numerical_columns = df.select_dtypes(include='object').columns
for col in non_numerical_columns:
  df[col] = df[col].astype('category')
print(df.select_dtypes(include='category').info())

In [None]:
df.info()

### • Now look at numerical values

<b>Memory usage information for dtype of int</b><br>
- int8	Byte (-128 to 127)<br>
- int16	Integer (-32768 to 32767)<br>
- int32	Integer (-2147483648 to 2147483647)<br>
- int64	Integer (-9223372036854775808 to 9223372036854775807)<br>

In [None]:
#look at numerical values again
df.select_dtypes(include='number').head()

<b>- Numerical values seem like they fall between 0 - 100 since they are all test scores.</b><br>
<b>- Let's Look at min and max values for each numerical column, so we will know that for sure.</b>

In [None]:
print(f"- Min scores:\n{df.select_dtypes(include='number').min()}")
print('\n')
print(f"- Max scores:\n{df.select_dtypes(include='number').max()}")

In [None]:
#compare memory usage when using int64 vs int8
print(f"{df.select_dtypes(include='number').memory_usage()}\n") #int64
print(f"{df.select_dtypes(include='number').astype('int8').memory_usage()}\n") #int8

<b>- Memory usage using int8 is much more efficient for our situation.</b>

In [None]:
#Convert int64 to int8
numerical_columns = df.select_dtypes(include='number').columns
for col in numerical_columns:
  df[col] = df[col].astype('int8')
print(df.select_dtypes(include='number').info())

<b>- Now memory usage for numerical values are as 1/8 times smaller! </b>

## <b>Missing Values</b>

In [None]:
#Make sure if there is any missing values in the data.
df.isna().sum()

<b>- We would handle missing values here, by either dropping them or replace with other values. </b>
<br>
<b>- However, the data used for this analysis doesn't contain any, so skip this phase.</b>

## <b>Descriptive Statistics</b>

In [None]:
#show summary statistics with .describe()
df.describe(include='number').T

In [None]:
#.describe() on object/categorical dtypes show counts, nunique, etc.
df.describe(include='category').T

## **Visualizations**

In [None]:
#set color palette
sns.set(palette='colorblind')

In [None]:
#check distribution & trend of test scores. Anything that stands out?
sns.pairplot(df)
plt.show()

<b>• There are obviously positive correlation between each test score.</b>

In [None]:
#take a closer look
fig, ax = plt.subplots(figsize=(12,6))
ax = sns.regplot(df.writing_score, df.reading_score)
ax.set_title('reading score vs writing score')
plt.show()

In [None]:
#take a closer look at the score distribution
#show multiple plots
fig, axes = plt.subplots(1, 3, sharex=True, sharey=True)
axes[0].boxplot(df.math_score)
axes[0].set_title('math score')
axes[1].boxplot(df.reading_score)
axes[1].set_title('reading score')

axes[2].boxplot(df.writing_score)
axes[2].set_title('writing score')

plt.show()

**- Each score distribution overall looks the same, except there more outliers in math test.**

<b>- Let's see which test score contributes the most/least to the avg of three tests.</b>

In [None]:
#Make a new column with the avg of three scores
df['average_of_three_tests'] = df[['math_score', 'reading_score', 'writing_score']].mean(axis=1).round(0).astype('int8')

In [None]:
#check its distribution
sns.distplot(df.average_of_three_tests, bins=30)
plt.show()

### • Some correlation analysis

In [None]:
df.corr()

In [None]:
#Heatmap to clearly show the correlation values between variables
sns.heatmap(df.corr(), cmap="YlGnBu")

<b>- Every score has at least the value of 0.8 to each other, which shows the strong correlation relationship between variables. </b><br><br>
<b>- Math score seems to be a less important factor to the avg of three tests compared to reading and writing score.</b>

Next, let's see how categorical data impact the avg of three scores.

In [None]:
#Review, there are five categorical variables
df.select_dtypes(include='category').describe().T

In [None]:
def autolabel(viz):
    '''For labling on bar chart'''
    for p in viz.patches:
        viz.annotate(format(p.get_height(), '.2f'), 
        (p.get_x() + p.get_width() / 2., p.get_height()), 
        ha = 'center', 
        va = 'center', 
        xytext = (0, 10), 
        textcoords = 'offset points')
    

In [None]:
def get_sorted_cat_values(num_column, cat_column, ascending=True):
  '''returns a list of sorted values in cat_column sorted by the mean of num_column.
     only accept by str name as in "gender." ''' 
  #----------------------------------------------------------------
  #arguments should be str
  if not (isinstance(num_column, str) and isinstance(cat_column, str)):
    raise ValueError('Enter column by its name in string!')
  #num_column should be numerical
  if not ( df[num_column].dtype.name.__contains__('int') or df[num_column].dtype.name.__contains__('float')):
    raise ValueError(f'First argument should be int or float. Your type was: {df[num_column].dtype.name}')
  #cat_column should be categorical
  if ((df[cat_column].dtype.name) != 'category'):
    raise ValueError(f'Seccond argument should be categorical. Your type was: {df[cat_column].dtype.name}')
  #----------------------------------------------------------------

  sorted_values = ( 
                    df.groupby(cat_column)[num_column]
                   .mean()
                   .sort_values(ascending=ascending)
                   .index
                   .unique()
                  )
  return sorted_values


In [None]:
#gender
fig = plt.figure(figsize=(12, 6))
viz = sns.barplot('gender', 'average_of_three_tests', data=df, order=get_sorted_cat_values('average_of_three_tests', 'gender', ascending=False), ci = None)
plt.title('average_of_three_tests with gender')
autolabel(viz)
plt.show()

**- female has slightly higher test scores**

In [None]:
#race
fig = plt.figure(figsize=(12, 6))
viz = sns.barplot('race_ethnicity', 'average_of_three_tests', data=df, order=get_sorted_cat_values('average_of_three_tests', 'race_ethnicity', ascending=False), ci = None)
plt.title('average_of_three_tests with race')
autolabel(viz)
plt.show()

<b>- About 10 points difference between group E and group A. </b>

In [None]:
#paret educaiton level
fig = plt.figure(figsize=(12, 6))
viz = sns.barplot(df.parental_level_of_education, df.average_of_three_tests, order=get_sorted_cat_values('average_of_three_tests', 'parental_level_of_education', False), ci = None)
plt.title('average_of_three_tests with parental_level_of_education')
autolabel(viz)
plt.show()

<b>- Looks like parent's education level matters. the higher education student's parent has, the better score that student gets. </b>

In [None]:
#lunch
fig = plt.figure(figsize=(12, 6))
viz = sns.barplot(df.lunch, df.average_of_three_tests, order=get_sorted_cat_values('average_of_three_tests', 'lunch', False), ci = None)
plt.title('average_of_three_tests with lunch')
autolabel(viz)
plt.show()

**- About 8 points difference**

In [None]:
#test prep course
fig = plt.figure(figsize=(12, 6))
viz = sns.barplot(df.test_preparation_course, df.average_of_three_tests, order=get_sorted_cat_values('average_of_three_tests', 'test_preparation_course', False), ci = None)
plt.title('average_of_three_tests with test prep course')
autolabel(viz)
plt.show()

**- Test score is quite different if a student completed the test prep course. This can be considered as one of the best variables when predicting the test scores.**

## <b>Conclusion</b>

### • Results

1. Data contained some categorical values and numerica values ranging from 0-100, so I was able to reduce the memory usage a lot.
<br><br>
2. All of three tests (math, reading, writing) are strongly correlated with the average test scores, but math test was the least impactfull and that it had more extreme outliers than the other two.
<br><br>
3. Race, parent's education, lunch are especially related to students' test scores. Max of around 10 points difference in the avg test scores.
<br><br>
4. Test prep course has some impact, though, we don't yet know how significant its impact is.
<br><br>
5. We still would like to see how a combination of these categorical variables would be related to the test scores.

### • Next steps / New assumptions to test

1. One of the assumptions is that students of a particular race tends to be wealthier/not-wealithier than other race. <br>
That could lead to parents having a higher/lower education, being able/not-able to provide children the standard lunch. 
<br><br>
2. We would like to conduct statistical tests to see if difference/results we saw are statistically significant.
<br><br>
3. As the final step, we'd like to build a predictive model based on variables we will have selected by statistical tests as well as other feature engineering techniques.
