# EDA

The aim of this Exploratory Data Analysis (EDA) is to get a clear picture of the available data before performing the linear regression. This will allow us to identify the type of data we are working with and to determine if any pre-processing is necessary, such as removing duplicate or null values, as well as converting categorical variables into dummy variables. Since the project focuses on prediction using linear regression, it is crucial to ensure that the data are adequately prepared for this purpose, thus avoiding the use of categorical variables in their original form.

In [34]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from tabulate import tabulate

In [35]:
df = pd.read_csv('studentsperformance.csv')

### See the data we have so far

In [36]:
df.head(5)

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75


In [37]:
df.dtypes

gender                         object
race/ethnicity                 object
parental level of education    object
lunch                          object
test preparation course        object
math score                      int64
reading score                   int64
writing score                   int64
dtype: object

In [38]:
df.shape

(1000, 8)

### Replace spaces by underscores

In [39]:
df.columns = df.columns.str.replace(' ', '_')
df.head(5)

Unnamed: 0,gender,race/ethnicity,parental_level_of_education,lunch,test_preparation_course,math_score,reading_score,writing_score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75


### Searching for null and duplicate values

In [40]:
# Null Values
df.isna().sum()

gender                         0
race/ethnicity                 0
parental_level_of_education    0
lunch                          0
test_preparation_course        0
math_score                     0
reading_score                  0
writing_score                  0
dtype: int64

In [41]:
# Duplicate Values
df.duplicated().sum()

np.int64(0)