# Student Dataset

In [3]:
import pandas as pd
df = pd.read_csv("bi.csv", encoding = "ISO-8859-1")

### Level 1: Tasks – Data Cleaning, Missing Data, Outliers

##### Part 1 – Data Cleaning 

| Column Name    | Description                                                                                   | Type         |
|----------------|----------------------------------------------------------------------------------------------|--------------|
| fNAME          | First name of the student.                                                                   | Categorical  |
| lNAME          | Last name of the student.                                                                    | Categorical  |
| Age            | Age of the student in years.                                                                 | Numerical    |
| gender         | Gender of the student (Male/Female).                                                         | Categorical  |
| country        | Country of origin of the student.                                                            | Categorical  |
| residence      | Current residence or type of residence (e.g., BI Residence, Private).                        | Categorical  |
| entryEXAM      | Score obtained in the entry exam (out of 100).                                               | Numerical    |
| prevEducation  | Highest level of education completed (High School, Diploma, Bachelor, Masters, Doctorate).   | Categorical  |
| studyHOURS     | Number of hours spent studying weekly.                                                       | Numerical    |
| Python         | Score achieved in the Python programming course (out of 100).                                | Numerical    |
| DB             | Score achieved in the Database (DB) course (out of 100).                                    | Numerical    |




---
---


In [None]:
# Check dataset structure

shape = df.shape #(77, 11)
df.tail() #df.head()
df.sample(5)
df.info()
# df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77 entries, 0 to 76
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   fNAME          77 non-null     object 
 1   lNAME          77 non-null     object 
 2   Age            77 non-null     int64  
 3   gender         77 non-null     object 
 4   country        77 non-null     object 
 5   residence      77 non-null     object 
 6   entryEXAM      77 non-null     int64  
 7   prevEducation  77 non-null     object 
 8   studyHOURS     77 non-null     int64  
 9   Python         75 non-null     float64
 10  DB             77 non-null     int64  
dtypes: float64(1), int64(4), object(6)
memory usage: 6.7+ KB


- Gender column contains more than two categories (should be only Male/Female)
- prevEducation values are case sensitive and inconsistent (e.g., "High School" vs "HighSchool")
- Python & DB columns should be of type float or int
- Python column has missing values (75 non-null out of 77)

In [22]:
#Detect inconsistent categories

#Nothing
fname = df['fNAME'].unique()
lname = df['lNAME'].unique()
age = df['Age'].unique()
db = df['DB'].unique()
study = df['studyHOURS'].unique()
entryEXAM = df['entryEXAM'].unique()

gender = df['gender'].unique()
country = df['country'].unique()
residence = df['residence'].unique()
pre = df['prevEducation'].unique()
python = df['Python'].unique() #nan is not recognized

country

array(['Norway', 'Kenya', 'Uganda', 'Rsa', 'South Africa', 'Norge',
       'norway', 'Denmark', 'Netherlands', 'Italy', 'Spain', 'UK',
       'Somali', 'Nigeria', 'Germany', 'France'], dtype=object)

- As expected gender : ['Female', 'M', 'Male', 'F', 'female', 'male']
- In country: Rsa? Norge, norway + Norway?
- As expected prevEducation: HighSchool/High School/Barrrchelors/diploma etc
- In Residence BI_residence ...


In [36]:
# Fixing the above

df['gender'] = df['gender'].replace({'F': 'Female', 'female': 'Female', 'M': 'Male', 'male': 'Male'})

df['country'] = df['country'].replace({'norway': 'Norway', 'Norge': 'Norway', 'Rsa': 'South Africa'})


df['prevEducation'] = df['prevEducation'].replace({
    'HighSchool': 'High School',
    'Barrrchelors': 'Bachelors',
    'diploma': 'Diploma',
    'DIPLOMA': 'Diploma',
    'Diplomaaa': 'Diploma' })

df['residence'] = df['residence'].replace({ 'BIResidence': 'BI Residence', 'BI_Residence': 'BI Residence', 'BI-Residence': 'BI Residence'})

df['Python'] = pd.to_numeric(df['Python'])
df['DB'] = pd.to_numeric(df['DB'])

In [None]:
#Handle duplicates
df.duplicated().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77 entries, 0 to 76
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   fNAME          77 non-null     object 
 1   lNAME          77 non-null     object 
 2   Age            77 non-null     int64  
 3   gender         77 non-null     object 
 4   country        77 non-null     object 
 5   residence      77 non-null     object 
 6   entryEXAM      77 non-null     int64  
 7   prevEducation  77 non-null     object 
 8   studyHOURS     77 non-null     int64  
 9   Python         75 non-null     float64
 10  DB             77 non-null     int64  
dtypes: float64(1), int64(4), object(6)
memory usage: 6.7+ KB


---
##### Part 2 – Missing Data

In [54]:
#Identify missing values
df.isnull().sum()
num_missing_python = df['Python'].isnull()
mean = df['Python'].mean().round()

df['Python'] = df['Python'].fillna(mean)
python = df['Python'].unique()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77 entries, 0 to 76
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   fNAME          77 non-null     object 
 1   lNAME          77 non-null     object 
 2   Age            77 non-null     int64  
 3   gender         77 non-null     object 
 4   country        77 non-null     object 
 5   residence      77 non-null     object 
 6   entryEXAM      77 non-null     int64  
 7   prevEducation  77 non-null     object 
 8   studyHOURS     77 non-null     int64  
 9   Python         77 non-null     float64
 10  DB             77 non-null     int64  
dtypes: float64(1), int64(4), object(6)
memory usage: 6.7+ KB
