### **Deep Learning Assignment-1 : Data Preparation**

**Name:** Yesha Pandya

**Enrollment Number:** 23bt04175

**Div:** 1 **Batch:** C

**Objective:**
1. Check for the missing values
2. Check for the Null Values
3. Check for the duplicate values
4. List the number of columns
5. Scale the numeric data
6. Convert the binary data to 1 or 0
7. Convert the categorical data to vector form
8. Display the total number of columns
9. Display all the records before scaling
10. Display all the records after scaling

In [None]:
#import required libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, LabelEncoder

#load the dataset
df = pd.read_csv('College.csv')

#create a copy to display "Before Scaling" later
df_original = df.copy()

print("Setup Complete. Data Loaded.")

Setup Complete. Data Loaded.


**Data Inspection**

Handling Missing and Duplicate Data (Q1, Q2, Q3)

Before feeding data into a neural network, we must ensure it is clean.

* Missing/Null Values: Can cause errors or bias in training. We check for NaN (Not a Number).

* Duplicates: Redundant data can lead to overfitting, where the model memorizes specific examples rather than learning patterns.

In [None]:
# 1. check for the missing values
print("\nQ1: Missing Values per column:")
print(df.isnull().sum())

# 2. check for the Null values
# (in pandas, isnull() catches both None and NaN)
print("\nQ2: Are there any Null values in the dataset?")
print(df.isnull().values.any())

# 3. check for the duplicate values
duplicates = df.duplicated().sum()
print(f"\nQ3: Number of duplicate rows: {duplicates}")


Q1: Missing Values per column:
type_school              0
school_accreditation     0
gender                   0
interest                 0
residence                0
parent_age               0
parent_salary            0
house_area               0
average_grades           0
parent_was_in_college    0
in_college               0
dtype: int64

Q2: Are there any Null values in the dataset?
False

Q3: Number of duplicate rows: 0


**Dataset Structure**

Understanding Dimensionality (Q4)

Knowing the number of columns (features) is essential for defining the input layer size of a Deep Learning model. If we have 10 columns, our input layer will typically have 10 neurons.

In [None]:
# 4. list the number of columns
print(f"Q4: Number of columns: {len(df.columns)}")
print(f"Column Names: {list(df.columns)}")

Q4: Number of columns: 11
Column Names: ['type_school', 'school_accreditation', 'gender', 'interest', 'residence', 'parent_age', 'parent_salary', 'house_area', 'average_grades', 'parent_was_in_college', 'in_college']


**Feature Scaling**

Standardization (Q5)

Deep Learning models use Gradient Descent optimization. If features have vastly different ranges (e.g., Parent Salary in thousands vs. Grades in single digits), the model struggles to converge.
* StandardScaler: We transform numeric data to have a Mean ($\mu$) of 0 and Standard Deviation ($\sigma$) of 1 using the formula:$$z = \frac{x - \mu}{\sigma}$$

In [None]:
# 5. scale the numeric data

#identify numeric columns based on the dataset structure
numeric_cols = ['parent_age', 'parent_salary', 'house_area', 'average_grades']

scaler = StandardScaler()
df[numeric_cols] = scaler.fit_transform(df[numeric_cols])

print("Q5: Numeric data scaled using StandardScaler.")
print(df[numeric_cols].head())

Q5: Numeric data scaled using StandardScaler.
   parent_age  parent_salary  house_area  average_grades
0    1.083838       1.122836    0.555074       -0.594365
1    1.369661      -0.695545    0.149467        0.240684
2   -0.631096       0.800682    0.398065        0.394664
3   -0.916918       0.872272    0.241055       -1.177715
4    1.369661      -0.094191    0.038251        0.205150


**Binary Encoding**
Encoding Binary Variables (Q6)

Machine Learning models require numerical input.

* Binary Data: Categorical data with only two options (e.g., Male/Female, Yes/No).

* Label Encoding: We map these directly to 0 and 1.

In [None]:
# 6. Convert the binary data to 1 or 0

#identify binary columns (Columns with only 2 unique values)
binary_cols = ['type_school', 'school_accreditation', 'gender', 'residence', 'parent_was_in_college', 'in_college']

le = LabelEncoder()

for col in binary_cols:
    #check to ensure column exists before processing
    if col in df.columns:
        df[col] = le.fit_transform(df[col])

print("Q6: Binary data converted to 0/1.")
print(df[binary_cols].head())

Q6: Binary data converted to 0/1.
   type_school  school_accreditation  gender  residence  \
0            0                     0       1          1   
1            0                     0       1          1   
2            0                     1       0          1   
3            1                     1       1          0   
4            0                     0       0          1   

   parent_was_in_college  in_college  
0                      0           1  
1                      0           1  
2                      0           1  
3                      1           1  
4                      0           0  


**Vectorization**

One-Hot Encoding (Q7)

For categorical variables with more than two categories (like interest), we cannot simply use 1, 2, 3, 4 because the model might assume a mathematical order (4 > 1) where none exists.

* Vector Form (One-Hot Encoding): We create a new binary column for each category. If interest is "Sports", the vector might look like [0, 0, 1, 0].

In [None]:
# 7. Convert the categorical data to vector form

categorical_cols = ['interest']

#get_dummies converts categorical variables into dummy/indicator variables (Vector form)
df = pd.get_dummies(df, columns=categorical_cols, prefix=categorical_cols)

#ensure the new columns are integers (0/1) instead of booleans (True/False)
for col in df.columns:
    if df[col].dtype == 'bool':
        df[col] = df[col].astype(int)

print("Q7: Categorical data converted to vector form (One-Hot Encoded).")
print("New columns created:", [col for col in df.columns if 'interest' in col])

Q7: Categorical data converted to vector form (One-Hot Encoded).
New columns created: ['interest_Less Interested', 'interest_Not Interested', 'interest_Quiet Interested', 'interest_Uncertain', 'interest_Very Interested']


**Final Verification**

Result Comparison (Q8, Q9, Q10)

Finally, we verify the transformation. The number of columns will increase due to vectorization. We compare the raw data (df_original) with the processed data (df) to demonstrate the changes to the examiner.

In [None]:
# 8. display the total number of columns
print(f"Q8: Total number of columns after processing: {len(df.columns)}")
print("-" * 50)

# 9. display all the records before scaling (Showing first 5 for readability)
print("Q9: Records BEFORE scaling (df_original):")
display(df_original.head())
print("-" * 50)

# 10. display all the records after scaling (Showing first 5 for readability)
print("Q10: Records AFTER scaling and processing (df):")
display(df.head())

Q8: Total number of columns after processing: 15
--------------------------------------------------
Q9: Records BEFORE scaling (df_original):


Unnamed: 0,type_school,school_accreditation,gender,interest,residence,parent_age,parent_salary,house_area,average_grades,parent_was_in_college,in_college
0,Academic,A,Male,Less Interested,Urban,56,6950000,83.0,84.09,False,True
1,Academic,A,Male,Less Interested,Urban,57,4410000,76.8,86.91,False,True
2,Academic,B,Female,Very Interested,Urban,50,6500000,80.6,87.43,False,True
3,Vocational,B,Male,Very Interested,Rural,49,6600000,78.2,82.12,True,True
4,Academic,A,Female,Very Interested,Urban,57,5250000,75.1,86.79,False,False


--------------------------------------------------
Q10: Records AFTER scaling and processing (df):


Unnamed: 0,type_school,school_accreditation,gender,residence,parent_age,parent_salary,house_area,average_grades,parent_was_in_college,in_college,interest_Less Interested,interest_Not Interested,interest_Quiet Interested,interest_Uncertain,interest_Very Interested
0,0,0,1,1,1.083838,1.122836,0.555074,-0.594365,0,1,1,0,0,0,0
1,0,0,1,1,1.369661,-0.695545,0.149467,0.240684,0,1,1,0,0,0,0
2,0,1,0,1,-0.631096,0.800682,0.398065,0.394664,0,1,0,0,0,0,1
3,1,1,1,0,-0.916918,0.872272,0.241055,-1.177715,1,1,0,0,0,0,1
4,0,0,0,1,1.369661,-0.094191,0.038251,0.20515,0,0,0,0,0,0,1
