### Perform various data preprocessing techniques like handling missing data and feature scaling.

#### step 1: Start by importing the necessary Python libraries for data preprocessing.


In [31]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler

#### Step 2: Load the placement dataset into a Pandas Dataframe.

In [3]:
DF=pd.read_csv("data.csv",index_col=0)
DF.head()

Unnamed: 0_level_0,gender,hsc_p,hsc_s,degree_p,degree_t,etest_p,specialisation,mba_p,salary
sl_no,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,M,91.0,Commerce,58.0,Sci&Tech,55.0,Mkt&HR,58.8,270000.0
2,M,78.33,Science,77.48,Sci&Tech,86.5,Mkt&Fin,66.28,200000.0
3,M,,Arts,64.0,Comm&Mgmt,75.0,Mkt&Fin,57.8,250000.0
4,M,52.0,Science,,Sci&Tech,66.0,Mkt&HR,59.43,
5,M,73.6,Commerce,73.3,Comm&Mgmt,96.8,Mkt&Fin,55.5,425000.0


#### Step 3:Take a quick look at the data to understand its structure and identify any missing values or anomalies.

In [4]:
DF.info()
DF.shape

<class 'pandas.core.frame.DataFrame'>
Index: 215 entries, 1 to 215
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   gender          215 non-null    object 
 1   hsc_p           210 non-null    float64
 2   hsc_s           215 non-null    object 
 3   degree_p        213 non-null    float64
 4   degree_t        215 non-null    object 
 5   etest_p         211 non-null    float64
 6   specialisation  215 non-null    object 
 7   mba_p           214 non-null    float64
 8   salary          148 non-null    float64
dtypes: float64(5), object(4)
memory usage: 16.8+ KB


(215, 9)

#### The method isnull() checks each element in the DataFrame (or Series) to see if it is NaN (Not a Number) or None (missing value).
It returns a DataFrame (or Series) of the same shape as the input, with Boolean values:
#### True: The value is null (NaN or None).
#### False: The value is not null.

#### Step 4: Handle Missing Data
#### Option 1: If the dataset is large and only a small percentage of data is missing, you can remove rows with missing values using dropna(subset,inplace)


In [5]:
DF.dropna(subset=["salary"],inplace=True)
DF.info()

<class 'pandas.core.frame.DataFrame'>
Index: 148 entries, 1 to 214
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   gender          148 non-null    object 
 1   hsc_p           146 non-null    float64
 2   hsc_s           148 non-null    object 
 3   degree_p        147 non-null    float64
 4   degree_t        148 non-null    object 
 5   etest_p         146 non-null    float64
 6   specialisation  148 non-null    object 
 7   mba_p           148 non-null    float64
 8   salary          148 non-null    float64
dtypes: float64(5), object(4)
memory usage: 11.6+ KB


#### Option 2:If removing data isn't ideal, you can impute (df.[""].fillna(df[""].mean(),inplace)) missing values using methods like mean, median, or most frequent.

In [7]:
DF['hsc_p'].fillna(DF['hsc_p'].mean(),inplace=True)
DF['degree_p'].fillna(DF['degree_p'].mean(),inplace=True)
DF['etest_p'].fillna(DF['etest_p'].mean(),inplace=True)
DF.info()


<class 'pandas.core.frame.DataFrame'>
Index: 148 entries, 1 to 214
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   gender          148 non-null    object 
 1   hsc_p           148 non-null    float64
 2   hsc_s           148 non-null    object 
 3   degree_p        148 non-null    float64
 4   degree_t        148 non-null    object 
 5   etest_p         148 non-null    float64
 6   specialisation  148 non-null    object 
 7   mba_p           148 non-null    float64
 8   salary          148 non-null    float64
dtypes: float64(5), object(4)
memory usage: 11.6+ KB


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  DF['hsc_p'].fillna(DF['hsc_p'].mean(),inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  DF['degree_p'].fillna(DF['degree_p'].mean(),inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we ar

#### Step 5: Feature Scaling


<img src="https://i.postimg.cc/G21gMYnF/f.png" alt="Image Description" width="500">









 Option 1: This method scales the data to have a mean of 0 and a standard deviation of 1.
### StandardScaler()

In [12]:
c=["hsc_p","degree_p","etest_p","salary"]
SC1=StandardScaler()
DF[c]=SC1.fit_transform(DF[c])
DF.head

<bound method NDFrame.head of       gender         hsc_p     hsc_s  degree_p   degree_t   etest_p  \
sl_no                                                                 
1          M  2.265997e+00  Commerce -1.652293   Sci&Tech -1.328518   
2          M  8.987875e-01   Science  1.346845   Sci&Tech  0.978332   
3          M  1.533482e-15      Arts -0.728534  Comm&Mgmt  0.136149   
5          M  3.883770e-01  Commerce  0.703293  Comm&Mgmt  1.732635   
8          M -6.475513e-01   Science -0.420614   Sci&Tech -0.449718   
...      ...           ...       ...       ...        ...       ...   
210        M  2.157223e-01  Commerce -0.574574  Comm&Mgmt -0.449718   
211        M  1.294814e+00  Commerce  1.365320  Comm&Mgmt  1.307882   
212        M -1.079188e+00   Science  0.503145   Sci&Tech  0.062915   
213        M -3.238237e-01  Commerce  0.657105  Comm&Mgmt -1.035584   
214        F -4.317329e-01  Commerce -1.652293  Comm&Mgmt -0.230018   

      specialisation  mba_p    salary  
sl_no 

#### Option 2:This method scales the data to a fixed range, usually between 0 and 1. 
###  MinMaxScaler()

In [14]:
c=["hsc_p","degree_p","etest_p","salary"]
SC1=MinMaxScaler()
DF[c]=SC1.fit_transform(DF[c])
DF.head

<bound method NDFrame.head of       gender     hsc_p     hsc_s  degree_p   degree_t   etest_p  \
sl_no                                                             
1          M  0.857051  Commerce  0.057143   Sci&Tech  0.104167   
2          M  0.586729   Science  0.613714   Sci&Tech  0.760417   
3          M  0.409023      Arts  0.228571  Comm&Mgmt  0.520833   
5          M  0.485812  Commerce  0.494286  Comm&Mgmt  0.975000   
8          M  0.280990   Science  0.285714   Sci&Tech  0.354167   
...      ...       ...       ...       ...        ...       ...   
210        M  0.451675  Commerce  0.257143  Comm&Mgmt  0.354167   
211        M  0.665031  Commerce  0.617143  Comm&Mgmt  0.854167   
212        M  0.195648   Science  0.457143   Sci&Tech  0.500000   
213        M  0.344997  Commerce  0.485714  Comm&Mgmt  0.187500   
214        F  0.323661  Commerce  0.057143  Comm&Mgmt  0.416667   

      specialisation  mba_p    salary  
sl_no                                  
1             Mkt&

####  Step 6:Separate the dataset into features (X) and target (y) variables. The target is usually the column you want to predict.

In [15]:
x=DF.drop('salary',axis=1)
y=DF['salary']
y.head()

sl_no
1    0.094595
2    0.000000
3    0.067568
5    0.304054
8    0.070270
Name: salary, dtype: float64


### Step 7: After preprocessing, save the cleaned and scaled dataset to a new CSV file


In [19]:
final=pd.concat([x,y],axis=1)
final.to_csv('pre.csv',index=False)

In [12]:
# Lab-1 Activities

#Perform data preprocesing for Automobile.csv

#i. Delete the column horsepower since it has few missing values

#ii. Impute missing with meadin

#iii. Apply min-max scaling and standardization on the Automobiles.csv and provide the reasoning which feature scaling method make more sense to this dataset.

In [21]:
df = pd.read_csv("Automobile.csv")
df.head()

Unnamed: 0,name,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin
0,chevrolet chevelle malibu,18.0,8.0,307.0,130.0,3504.0,12.0,70,usa
1,buick skylark 320,15.0,8.0,350.0,165.0,3693.0,11.5,70,usa
2,plymouth satellite,18.0,8.0,318.0,150.0,3436.0,11.0,70,usa
3,amc rebel sst,16.0,8.0,304.0,150.0,3433.0,12.0,70,usa
4,ford torino,17.0,,302.0,140.0,3449.0,10.5,70,usa


In [22]:
df.drop('horsepower', axis=1, inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   name          398 non-null    object 
 1   mpg           398 non-null    float64
 2   cylinders     395 non-null    float64
 3   displacement  395 non-null    float64
 4   weight        396 non-null    float64
 5   acceleration  395 non-null    float64
 6   model_year    398 non-null    int64  
 7   origin        398 non-null    object 
dtypes: float64(5), int64(1), object(2)
memory usage: 25.0+ KB


In [23]:
df['cylinders'].fillna(df['cylinders'].median(), inplace=True)
df['displacement'].fillna(df['displacement'].median(), inplace=True)
df['weight'].fillna(df['weight'].median(), inplace=True)
df['acceleration'].fillna(df['acceleration'].median(), inplace=True)

df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   name          398 non-null    object 
 1   mpg           398 non-null    float64
 2   cylinders     398 non-null    float64
 3   displacement  398 non-null    float64
 4   weight        398 non-null    float64
 5   acceleration  398 non-null    float64
 6   model_year    398 non-null    int64  
 7   origin        398 non-null    object 
dtypes: float64(5), int64(1), object(2)
memory usage: 25.0+ KB


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['cylinders'].fillna(df['cylinders'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['displacement'].fillna(df['displacement'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate

In [24]:
c = ["cylinders", "displacement", "weight", "acceleration", "mpg"]
SC1 = StandardScaler()
df[c] = SC1.fit_transform(df[c])
df.head()


Unnamed: 0,name,mpg,cylinders,displacement,weight,acceleration,model_year,origin
0,chevrolet chevelle malibu,-0.706439,1.515897,1.096516,0.641001,-1.301636,70,usa
1,buick skylark 320,-1.090751,1.515897,1.510055,0.865428,-1.484357,70,usa
2,plymouth satellite,-0.706439,1.515897,1.202305,0.560255,-1.667078,70,usa
3,amc rebel sst,-0.962647,1.515897,1.067664,0.556693,-1.301636,70,usa
4,ford torino,-0.834543,-0.847774,1.04843,0.575692,-1.849799,70,usa


In [25]:
c = ["cylinders", "displacement", "weight", "acceleration", "mpg"]
SC1 = MinMaxScaler()
df[c] = SC1.fit_transform(df[c])
df.head()


Unnamed: 0,name,mpg,cylinders,displacement,weight,acceleration,model_year,origin
0,chevrolet chevelle malibu,0.239362,1.0,0.617571,0.53615,0.238095,70,usa
1,buick skylark 320,0.159574,1.0,0.728682,0.589736,0.208333,70,usa
2,plymouth satellite,0.239362,1.0,0.645995,0.51687,0.178571,70,usa
3,amc rebel sst,0.18617,1.0,0.609819,0.516019,0.238095,70,usa
4,ford torino,0.212766,0.2,0.604651,0.520556,0.14881,70,usa


In [None]:
Standardization is preferred because the automobile dataset contains features with very different ranges (like mpg in tens and weight in thousands) and possible outliers, and standardization handles both effectively.