### Perform various data preprocessing techniques like handling missing data and feature scaling.

#### step 1: Start by importing the necessary Python libraries for data preprocessing.


In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler

#### Step 2: Load the placement dataset into a Pandas Dataframe.

In [6]:
df=pd.read_csv("data.csv")
df.head()

Unnamed: 0,sl_no,gender,hsc_p,hsc_s,degree_p,degree_t,etest_p,specialisation,mba_p,salary
0,1,M,91.0,Commerce,58.0,Sci&Tech,55.0,Mkt&HR,58.8,270000.0
1,2,M,78.33,Science,77.48,Sci&Tech,86.5,Mkt&Fin,66.28,200000.0
2,3,M,,Arts,64.0,Comm&Mgmt,75.0,Mkt&Fin,57.8,250000.0
3,4,M,52.0,Science,,Sci&Tech,66.0,Mkt&HR,59.43,
4,5,M,73.6,Commerce,73.3,Comm&Mgmt,96.8,Mkt&Fin,55.5,425000.0


#### Step 3:Take a quick look at the data to understand its structure and identify any missing values or anomalies.

In [8]:
df.info()
df.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 215 entries, 0 to 214
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   sl_no           215 non-null    int64  
 1   gender          215 non-null    object 
 2   hsc_p           210 non-null    float64
 3   hsc_s           215 non-null    object 
 4   degree_p        213 non-null    float64
 5   degree_t        215 non-null    object 
 6   etest_p         211 non-null    float64
 7   specialisation  215 non-null    object 
 8   mba_p           214 non-null    float64
 9   salary          148 non-null    float64
dtypes: float64(5), int64(1), object(4)
memory usage: 16.9+ KB


(215, 10)

#### The method isnull() checks each element in the DataFrame (or Series) to see if it is NaN (Not a Number) or None (missing value).
It returns a DataFrame (or Series) of the same shape as the input, with Boolean values:
#### True: The value is null (NaN or None).
#### False: The value is not null.

In [10]:
df.isnull().sum()

sl_no              0
gender             0
hsc_p              5
hsc_s              0
degree_p           2
degree_t           0
etest_p            4
specialisation     0
mba_p              1
salary            67
dtype: int64

#### Step 4: Handle Missing Data
#### Option 1: If the dataset is large and only a small percentage of data is missing, you can remove rows with missing values using dropna(subset,inplace)


In [11]:
df.dropna(subset=["salary"], inplace=True)

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 148 entries, 0 to 213
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   sl_no           148 non-null    int64  
 1   gender          148 non-null    object 
 2   hsc_p           146 non-null    float64
 3   hsc_s           148 non-null    object 
 4   degree_p        147 non-null    float64
 5   degree_t        148 non-null    object 
 6   etest_p         146 non-null    float64
 7   specialisation  148 non-null    object 
 8   mba_p           148 non-null    float64
 9   salary          148 non-null    float64
dtypes: float64(5), int64(1), object(4)
memory usage: 12.7+ KB


#### Option 2:If removing data isn't ideal, you can impute (df.[""].fillna(df[""].mean(),inplace)) missing values using methods like mean, median, or most frequent.

In [17]:
df["hsc_p"].fillna(df["hsc_p"].mean(), inplace=True)
df["degree_p"].fillna(df["degree_p"].mean(), inplace=True)
df["etest_p"].fillna(df["etest_p"].mean(), inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 148 entries, 0 to 213
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   sl_no           148 non-null    int64  
 1   gender          148 non-null    object 
 2   hsc_p           148 non-null    float64
 3   hsc_s           148 non-null    object 
 4   degree_p        148 non-null    float64
 5   degree_t        148 non-null    object 
 6   etest_p         148 non-null    float64
 7   specialisation  148 non-null    object 
 8   mba_p           148 non-null    float64
 9   salary          148 non-null    float64
dtypes: float64(5), int64(1), object(4)
memory usage: 12.7+ KB


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["hsc_p"].fillna(df["hsc_p"].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["degree_p"].fillna(df["degree_p"].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we 

#### Step 5: Feature Scaling


<img src="https://i.postimg.cc/G21gMYnF/f.png" alt="Image Description" width="500">









 Option 1: This method scales the data to have a mean of 0 and a standard deviation of 1.
### StandardScaler()

In [18]:
c=["hsc_p","degree_p","etest_p","salary"]
sc1=StandardScaler()
df[c]=sc1.fit_transform(df[c])
df.head()

Unnamed: 0,sl_no,gender,hsc_p,hsc_s,degree_p,degree_t,etest_p,specialisation,mba_p,salary
0,1,M,2.265997,Commerce,-1.652293,Sci&Tech,-1.328518,Mkt&HR,58.8,-0.200292
1,2,M,0.8987875,Science,1.346845,Sci&Tech,0.978332,Mkt&Fin,66.28,-0.951839
2,3,M,1.533482e-15,Arts,-0.728534,Comm&Mgmt,0.136149,Mkt&Fin,57.8,-0.415019
4,5,M,0.388377,Commerce,0.703293,Comm&Mgmt,1.732635,Mkt&Fin,55.5,1.463849
7,8,M,-0.6475513,Science,-0.420614,Sci&Tech,-0.449718,Mkt&Fin,62.14,-0.393547


#### Option 2:This method scales the data to a fixed range, usually between 0 and 1. 
###  MinMaxScaler()

In [19]:
d=["hsc_p","degree_p","etest_p","salary"]
sc2=MinMaxScaler()
df[d]=sc2.fit_transform(df[d])
df.head()

Unnamed: 0,sl_no,gender,hsc_p,hsc_s,degree_p,degree_t,etest_p,specialisation,mba_p,salary
0,1,M,0.857051,Commerce,0.057143,Sci&Tech,0.104167,Mkt&HR,58.8,0.094595
1,2,M,0.586729,Science,0.613714,Sci&Tech,0.760417,Mkt&Fin,66.28,0.0
2,3,M,0.409023,Arts,0.228571,Comm&Mgmt,0.520833,Mkt&Fin,57.8,0.067568
4,5,M,0.485812,Commerce,0.494286,Comm&Mgmt,0.975,Mkt&Fin,55.5,0.304054
7,8,M,0.28099,Science,0.285714,Sci&Tech,0.354167,Mkt&Fin,62.14,0.07027


####  Step 6:Separate the dataset into features (X) and target (y) variables. The target is usually the column you want to predict.

In [21]:
x=df[["gender","hsc_p","degree_p","degree_t","etest_p","specialisation","mba_p"]]
y=df["salary"]
y.head()

0    0.094595
1    0.000000
2    0.067568
4    0.304054
7    0.070270
Name: salary, dtype: float64


### Step 7: After preprocessing, save the cleaned and scaled dataset to a new CSV file


In [22]:
final=pd.concat([x,y],axis=1)
final.to_csv("Pre.csv",index=False)

In [12]:
# Lab-1 Activities

#Perform data preprocesing for Automobile.csv

#i. Delete the column horsepower since it has few missing values

#ii. Impute missing with meadin

#iii. Apply min-max scaling and standardization on the Automobiles.csv and provide the reasoning which feature scaling method make more sense to this dataset.