# Intro

In this notebook I'll apply different EDA (Exploratory Data Analysis) techniques on the [Graduate Admission 2 data](https://www.kaggle.com/mohansacharya/graduate-admissions).

The goal in this data is to predict the *student's chance of admission* to a postgraduate education, given several *predictor* variables for the student.

# Import libraries

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px
import hiplot as hip
from scipy.stats import skew

# set seaborn theme
sns.set_style(style="whitegrid")

# Load data

There are two data files:
- `Admission_Predict.csv`
- `Admission_Predict_Ver1.1.csv`
Will use the second one, since it contains more data points.

In [None]:
df = pd.read_csv("data/Admission_Predict_Ver1.1.csv")

According to the dataset author on Kaggle, the columns in this data represents:
- `GRE Score`: The Graduate Record Examinations is a standardized test that is an admissions requirement for many graduate schools in the United States and Canada.
- `TOEFL Score`: Score in TOEFL exam.
- `University Rating`: Student undergraduate university ranking.
- `SOP`: Statement of Purpose strength.
- `LOR`: Letter of Recommendation strength.
- `CGPA`: Undergraduate GPA.
- `Research`: Whether student has research experience or not.
- `Chance of Admit`: Admission chance.

# Getting to know the data

In this section, we'll take a quick look at the data, to see how many row are there, and whther there are any missing values or not, to decie what kind of preprocessing will be needed.

In [None]:
df.head()

In [None]:
df.columns

In [None]:
df.shape

In [None]:
df.isnull().sum()

In [None]:
df.dtypes

The dataset consists of 500 samples and 9 columns: 8 *predictors* and one *target* variable.

There are no missing values (which is a very good thing!), but some column names need to be cleaned, and the `Serial No.` must be removed, as it has nothing to do with the student's overall admission chance.

Lookin at the `dtypes` it seems that all columns are in the correct data type, discrete columns are in `int64` and continuous in `float64`.

# Data cleaning

As stated in the previous section, only few *cleaning* will be performed, mainly:
- remove extra whitespace from column names.
- drop `Serial No.` column

In [None]:
df.columns

Pandas has a great feature which allows us to apply multiple functions on the `DataFrame` in a sequential order: the [pipe](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pipe.html) method.

Here, I'll define two separate functions for applying each processing step, and then call them using the `pipe` function.

In [None]:
def normalize_column_names(temp_df):
    return temp_df.rename(
        columns={"LOR ": "LOR", "Chance of Admit ": "Chance of Admit"}
    )

In [None]:
def drop_noisy_columns(temp_df):
    return temp_df.drop(columns=["Serial No."])

Now, we plug them together:

In [None]:
df = df.pipe(normalize_column_names).pipe(drop_noisy_columns)

In [None]:
df.columns

In [None]:
df.shape

We *cleaned* the data with a *clean* code!

# Exploratory Data Analysis (EDA)

In this section, we'll explore the data *visually* and summarize it using *descriptive statistic* methods.

To keep things simpler, we'll divide this section into three subsections:
1. Univariate analysis: in this section we'll focus only at one variable at a time, and study the variable descriptive statistics with some charts like: Bar chart, Line chart, Histogram, Boxplot, etc ..., and how the variable is distributed, and if there is any *skewness* in the distribution.
2. Bivariate analysis: in this section we'll study the relation between *two* variables, and present different statistics such as Correlation, Covariance, and will use some other charts like: scatterplot, and will make use of the `hue` parameter of the previous charts.
3. Multivariate analysis: in this section we'll study the relation between three or more variables, and will use additional type of charts, such as parplot.

## Univariate Analysis

## Bivariate Analysis

## Multivariate Analysis