<a href="https://colab.research.google.com/github/ranjithdurgunala/ML-LAB-2025-2026/blob/main/Data_processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Data processing**

Data processing means cleaning, transforming, and organizing raw data into a usable format for analysis and machine learning.

Data Processing Techniques

a. Handling Missing Values

b. Encoding Categorical Data

c. Feature Scaling

d. Removing Outliers

e. Splitting Data for Training and Testing

In [1]:
import pandas as pd
data=pd.read_csv('Name-Age-Gender-Marks-City.csv')
data.head()

Unnamed: 0,Name,Age,Gender,Marks,City
0,Alice,18.0,F,85.0,London
1,Bob,,M,90.0,New York
2,Cathy,19.0,F,95.0,
3,David,17.0,M,72.0,Sydney
4,Eva,18.0,F,,London


**Handling Missing Values**

Missing values are empty data entries. These can be filled (imputation) or dropped.

In [2]:
data.fillna({
    'Age':data['Age'].mean(),
    'Marks':data['Marks'].mean(),
    'City':'Unknown'
},inplace=True)

print(data)

    Name   Age Gender  Marks      City
0  Alice  18.0      F   85.0    London
1    Bob  18.0      M   90.0  New York
2  Cathy  19.0      F   95.0   Unknown
3  David  17.0      M   72.0    Sydney
4    Eva  18.0      F   85.5    London


**Encoding Categorical Data**

Convert text categories into numbers for analysis.

In [4]:
data['Gender']=data['Gender'].map({'F':0,'M':1})
print(data)

    Name   Age  Gender  Marks      City
0  Alice  18.0       0   85.0    London
1    Bob  18.0       1   90.0  New York
2  Cathy  19.0       0   95.0   Unknown
3  David  17.0       1   72.0    Sydney
4    Eva  18.0       0   85.5    London


Use one-hot encoding for city

In [5]:
data = pd.get_dummies(data, columns=['City'])
print(data)

    Name   Age  Gender  Marks  City_London  City_New York  City_Sydney  \
0  Alice  18.0       0   85.0         True          False        False   
1    Bob  18.0       1   90.0        False           True        False   
2  Cathy  19.0       0   95.0        False          False        False   
3  David  17.0       1   72.0        False          False         True   
4    Eva  18.0       0   85.5         True          False        False   

   City_Unknown  
0         False  
1         False  
2          True  
3         False  
4         False  


**Feature Scaling**

Brings all numeric data to the same scale (Helpful for some algorithms).

In [6]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
data['Marks'] = scaler.fit_transform(data[['Marks']])
print(data)

    Name   Age  Gender     Marks  City_London  City_New York  City_Sydney  \
0  Alice  18.0       0  0.565217         True          False        False   
1    Bob  18.0       1  0.782609        False           True        False   
2  Cathy  19.0       0  1.000000        False          False        False   
3  David  17.0       1  0.000000        False          False         True   
4    Eva  18.0       0  0.586957         True          False        False   

   City_Unknown  
0         False  
1         False  
2          True  
3         False  
4         False  


**Removing Outliers**

Outliers are values that are much higher or lower than the rest. Here’s how to remove marks below 60 or above 100:

In [8]:
data=pd.read_csv('Name-Age-Gender-Marks-City.csv')
data = data[(data['Marks'] >= 60) & (data['Marks'] <= 100)]
print(data)

    Name   Age Gender  Marks      City
0  Alice  18.0      F   85.0    London
1    Bob   NaN      M   90.0  New York
2  Cathy  19.0      F   95.0       NaN
3  David  17.0      M   72.0    Sydney


**Splitting Data for Training and Testing**

Divide data into two sets—one for training your model, one for testing it.

In [11]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(data, test_size=0.2)
print("Train Data:\n", train)
print("Test Data:\n", test)

Train Data:
     Name   Age Gender  Marks      City
1    Bob   NaN      M   90.0  New York
0  Alice  18.0      F   85.0    London
3  David  17.0      M   72.0    Sydney
4    Eva  18.0      F    NaN    London
Test Data:
     Name   Age Gender  Marks City
2  Cathy  19.0      F   95.0  NaN
