# **Data Cleaning and Preparation for EDA**

## Objectives

* Download data from Kaggle and load into a pandas dataframe
* Clean and prepare dataset
* Engineer features for exploratory data analysis (EDA)

## Inputs

* Student academic performance dataset from Kaggle https://www.kaggle.com/datasets/sonalshinde123/student-academic-performance-dataset 

## Outputs

* Cleaned dataset output to "academic_performance_cleaned.csv". Location: data folder in student academic performance repository 



---

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [2]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\tb975\\OneDrive\\Documents\\vs_code_projects\\Student-Academic-Performance\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [3]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [4]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\tb975\\OneDrive\\Documents\\vs_code_projects\\Student-Academic-Performance'

# Section 1: Data loading and investigation

Start by importing the relavent python libraries I will need for data cleaning and exploration.

In [None]:
#import libraries for data manipulation
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

I will now read the csv file located in the data folder into a pandas dataframe for manipulation.

In [8]:
#load csv into pandas dataframe
df = pd.read_csv('data/student_academic_performance_raw.csv')
#display the first 5 rows
df.head()

Unnamed: 0,Student_ID,Attendance (%),Internal Test 1 (out of 40),Internal Test 2 (out of 40),Assignment Score (out of 10),Daily Study Hours,Final Exam Marks (out of 100)
0,S1000,84,30,36,7,3,72
1,S1001,91,24,38,6,3,56
2,S1002,73,29,26,7,3,56
3,S1003,80,36,35,7,3,74
4,S1004,84,31,37,8,3,66


I can investigate the size of the dataset by printing the shape of the table. This dataset has 2000 rows and 7 columns. This represents 2000 different pupils and their academic indicators.

In [None]:
#print the shape of the dataframe
print(df.shape)

(2000, 7)


I can also assess the column names and datatypes of the dataset using .info(). In section 2 I will consider the need to change datatypes as part of the cleaning process.

In [10]:
#print column names and datatypes 
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 7 columns):
 #   Column                         Non-Null Count  Dtype 
---  ------                         --------------  ----- 
 0   Student_ID                     2000 non-null   object
 1   Attendance (%)                 2000 non-null   int64 
 2   Internal Test 1 (out of 40)    2000 non-null   int64 
 3   Internal Test 2 (out of 40)    2000 non-null   int64 
 4   Assignment Score (out of 10)   2000 non-null   int64 
 5   Daily Study Hours              2000 non-null   int64 
 6   Final Exam Marks (out of 100)  2000 non-null   int64 
dtypes: int64(6), object(1)
memory usage: 109.5+ KB
None


From the information above it looks like there are no null values. This is confirmed below as I check null values there are in each column by using .isna()

In [None]:
#check for null values and sum for each column
df.isna().sum()

Student_ID                       0
Attendance (%)                   0
Internal Test 1 (out of 40)      0
Internal Test 2 (out of 40)      0
Assignment Score (out of 10)     0
Daily Study Hours                0
Final Exam Marks (out of 100)    0
dtype: int64

I can check for duplicates also: none found

In [15]:
#check for duplicates sum results
df.duplicated().sum()

0

I can use .describe to give me an idea of some summary statistics for the numerical columns. I can assess the mean and standard deviation as well as checking for sensible values using the max and min

In [16]:
#get summary statistics round to 2d.p.
df.describe().round(2)

Unnamed: 0,Attendance (%),Internal Test 1 (out of 40),Internal Test 2 (out of 40),Assignment Score (out of 10),Daily Study Hours,Final Exam Marks (out of 100)
count,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0
mean,84.89,32.12,32.46,7.51,2.82,64.86
std,7.76,4.56,4.52,1.02,0.61,11.34
min,52.0,18.0,16.0,4.0,1.0,25.0
25%,80.0,29.0,29.0,7.0,2.0,58.0
50%,85.0,32.0,33.0,8.0,3.0,65.0
75%,90.0,35.0,36.0,8.0,3.0,73.0
max,100.0,40.0,40.0,10.0,5.0,100.0


These summary statistics can be used to assess the data for sensible values. Checking the maximum and minumum values tells us the following:
- attendence has no values above 100% or below 0%
- both test scores (out of 40) do not have values above 40 or below 0
- the assignment score (out of 10) does not have any vlaues above 10 or below 0 
- daily study hours range from 1 to 5, which seams reasonable
- final exam marks (out of 100) have no values above 100 or below 0 

---

# Section 2: Data cleaning

Quick reminder of the data

In [None]:
#show the first 5 rows of the data
df.head()

Unnamed: 0,Student_ID,Attendance (%),Internal Test 1 (out of 40),Internal Test 2 (out of 40),Assignment Score (out of 10),Daily Study Hours,Final Exam Marks (out of 100)
0,S1000,84,30,36,7,3,72
1,S1001,91,24,38,6,3,56
2,S1002,73,29,26,7,3,56
3,S1003,80,36,35,7,3,74
4,S1004,84,31,37,8,3,66


An ethical consideration to make when using this dataset is the privacy of the pupils and with regard to them being identified in anyway. I do not need studentID at all for my analysis and hence can drop this column. This means that the data is fully annonomysed and no pupil can be identified in this dataset. In fact the data is synthetic, meaning that the rows do not represent real pupils. Despite this I will remove the studentID column any to simulate what would be best practice in a real world scenario. 

In [20]:
#drop student ID column
df = df.drop('Student_ID', axis=1)
df

Unnamed: 0,Attendance (%),Internal Test 1 (out of 40),Internal Test 2 (out of 40),Assignment Score (out of 10),Daily Study Hours,Final Exam Marks (out of 100)
0,84,30,36,7,3,72
1,91,24,38,6,3,56
2,73,29,26,7,3,56
3,80,36,35,7,3,74
4,84,31,37,8,3,66
...,...,...,...,...,...,...
1995,82,31,28,6,2,52
1996,78,38,27,7,2,57
1997,78,30,33,9,2,61
1998,82,29,40,8,3,59


In [21]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 6 columns):
 #   Column                         Non-Null Count  Dtype
---  ------                         --------------  -----
 0   Attendance (%)                 2000 non-null   int64
 1   Internal Test 1 (out of 40)    2000 non-null   int64
 2   Internal Test 2 (out of 40)    2000 non-null   int64
 3   Assignment Score (out of 10)   2000 non-null   int64
 4   Daily Study Hours              2000 non-null   int64
 5   Final Exam Marks (out of 100)  2000 non-null   int64
dtypes: int64(6)
memory usage: 93.9 KB


Now we have 6 columns, all with a datatype of int64. We can make the

In [None]:
cols = df.columns

for col in cols:
    if df[col].dtype == 'int64':
        df[col] = df[col].astype('int8')

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In cases where you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create your folder here
  # os.makedirs(name='')
except Exception as e:
  print(e)
