# **Data Cleaning and Preparation for EDA**

## Objectives

* Download data from Kaggle and load into a pandas dataframe
* Clean and prepare dataset
* Engineer features for exploratory data analysis (EDA)

## Inputs

* Student academic performance dataset from Kaggle https://www.kaggle.com/datasets/sonalshinde123/student-academic-performance-dataset 

## Outputs

* Cleaned dataset output to "academic_performance_cleaned.csv". Location: data folder in student academic performance repository 



---

# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [39]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\tb975\\OneDrive'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [40]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [41]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\tb975'

# Section 1: Data loading and investigation

Start by importing the relavent python libraries I will need for data cleaning and exploration.

In [42]:
#import libraries for data manipulation
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

I will now read the csv file located in the data folder into a pandas dataframe for manipulation.

In [44]:
#load csv into pandas dataframe
df = pd.read_csv('student_academic_performance_raw.csv')
#display the first 5 rows
df.head()

FileNotFoundError: [Errno 2] No such file or directory: 'student_academic_performance_raw.csv'

I can investigate the size of the dataset by printing the shape of the table. This dataset has 2000 rows and 7 columns. This represents 2000 different pupils and their academic indicators.

In [None]:
#print the shape of the dataframe
print(df.shape)

I can also assess the column names and datatypes of the dataset using .info(). In section 2 I will consider the need to change datatypes as part of the cleaning process.

In [None]:
#print column names and datatypes 
print(df.info())

From the information above it looks like there are no null values. This is confirmed below as I check null values there are in each column by using .isna()

In [None]:
#check for null values and sum for each column
df.isna().sum()

I can check for duplicates also: none found

In [None]:
#check for duplicates sum results
df.duplicated().sum()

I can use .describe to give me an idea of some summary statistics for the numerical columns. I can assess the mean and standard deviation as well as checking for sensible values using the max and min

In [None]:
#get summary statistics round to 2d.p.
df.describe().round(2)

These summary statistics can be used to assess the data for sensible values. Checking the maximum and minumum values tells us the following:
- attendence has no values above 100% or below 0%
- both test scores (out of 40) do not have values above 40 or below 0
- the assignment score (out of 10) does not have any vlaues above 10 or below 0 
- daily study hours range from 1 to 5, which seams reasonable
- final exam marks (out of 100) have no values above 100 or below 0 

---

# Section 2: Data cleaning

Quick reminder of the data

In [None]:
#show the first 5 rows of the data
df.head()

An ethical consideration to make when using this dataset is the privacy of the pupils and with regard to them being identified in anyway. I do not need studentID at all for my analysis and hence can drop this column. This means that the data is fully annonomysed and no pupil can be identified in this dataset. In fact the data is synthetic, meaning that the rows do not represent real pupils. Despite this I will remove the studentID column any to simulate what would be best practice in a real world scenario. 

In [None]:
#drop student ID column
df = df.drop('Student_ID', axis=1)
df

In [None]:
df.info()

Now we have 6 columns, all with a datatype of int64. In each column the largest number is 100. Because of this we can make the memory useage of this dataset smaller by changing their datatypes to int8. This provides a smaller memory allocation for each number. This dataset is relatively small, so it is not as important, but if there were more data then it may make a significant difference. 

In [None]:
#save the column names in a list called cols
cols = df.columns

#loop through the colums in cols and change the datatype from int64 to int8

df = df.astype({col: 'int8' for col in cols})

#display changed datatypes
df.info()

Memory useage has now fallen from 93.9kB to 11.8kB

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In cases where you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create your folder here
  # os.makedirs(name='')
except Exception as e:
  print(e)
