### 1. Handling Missing Values

Missing values are a common occurrence in datasets and can negatively impact the performance of machine learning models. There are several techniques to handle missing values, such as:

#### a. Removing Rows with Missing Values:
This approach involves removing entire rows containing missing values from the dataset.

In [1]:
import pandas as pd

# Example data with missing values
data = pd.DataFrame({'A': [1, 2, None, 4, 5],
                     'B': [None, 10, 11, 12, 13]})

# Drop rows with missing values
data_cleaned = data.dropna()

print("Original data with missing values:")
print(data)
print("\nData after removing rows with missing values:")
print(data_cleaned)

Original data with missing values:
     A     B
0  1.0   NaN
1  2.0  10.0
2  NaN  11.0
3  4.0  12.0
4  5.0  13.0

Data after removing rows with missing values:
     A     B
1  2.0  10.0
3  4.0  12.0
4  5.0  13.0


#### b. Removing Columns with Missing Values:
Alternatively, you can remove entire columns containing missing values from the dataset.

In [2]:
import pandas as pd

# Example data with missing values
data = pd.DataFrame({'A': [1, 2, None, 4, 5],
                     'B': [None, 10, 11, 12, 13],
                      'C': [3, 5, 1, 2, 6]  })

# Drop columns with missing values
data_cleaned = data.dropna(axis=1)

print("Original data with missing values:")
print(data)
print("\nData after removing columns with missing values:")
print(data_cleaned)

Original data with missing values:
     A     B  C
0  1.0   NaN  3
1  2.0  10.0  5
2  NaN  11.0  1
3  4.0  12.0  2
4  5.0  13.0  6

Data after removing columns with missing values:
   C
0  3
1  5
2  1
3  2
4  6


#### c. Imputation:
Imputation involves replacing missing values with a substitute value, such as the mean, median, or mode of the column.

In [3]:
import pandas as pd
from sklearn.impute import SimpleImputer

# Example data with missing values
data = pd.DataFrame({'A': [1, 2, None, 4, 5],
                     'B': [None, 10, 11, 12, 13]})

# Impute missing values using mean
imputer = SimpleImputer(strategy='mean')
data_imputed = imputer.fit_transform(data)
data_imputed = pd.DataFrame(data_imputed, columns=data.columns)

print("Original data with missing values:")
print(data)
print("\nData after imputation:")
print(data_imputed)

Original data with missing values:
     A     B
0  1.0   NaN
1  2.0  10.0
2  NaN  11.0
3  4.0  12.0
4  5.0  13.0

Data after imputation:
     A     B
0  1.0  11.5
1  2.0  10.0
2  3.0  11.0
3  4.0  12.0
4  5.0  13.0


### 2. Handling Outliers

Outliers are data points that significantly differ from the rest of the data. They can skew statistical analyses and machine learning models. Common techniques to handle outliers include:

#### a. Detecting Outliers:
One common method to detect outliers is using the z-score, which measures the number of standard deviations a data point is from the mean.

In [4]:
import pandas as pd
import numpy as np

# Example data with outliers
data = pd.DataFrame({'A': np.random.normal(loc=0, scale=1, size=100)})

# Detecting outliers using z-score
z_scores = (data['A'] - data['A'].mean()) / data['A'].std()
outliers = data[np.abs(z_scores) > 3]

print("Outliers:")
print(outliers)

Outliers:
Empty DataFrame
Columns: [A]
Index: []


#### b. Removing Outliers:
You can choose to remove outliers from the dataset.

In [82]:
import pandas as pd
import numpy as np

np.random.seed(42)
# Example data with outliers
data = pd.DataFrame({'A': np.random.normal(loc=0, scale=2, size=100)})

# Detecting outliers using z-score
z_scores = (data['A'] - data['A'].mean()) / data['A'].std()
outliers = data[np.abs(z_scores) > 3]

# Removing outliers
data_cleaned = data.drop(outliers.index)

print("Original data:")
print(data.head())
print("\nData with outliers removed:")
print(data_cleaned.count())

Original data:
          A
0  0.993428
1 -0.276529
2  1.295377
3  3.046060
4 -0.468307

Data with outliers removed:
A    100
dtype: int64


In [83]:
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3-Q1
Lower = Q1-(1.5*IQR)
Upper = Q3+(1.5*IQR)

In [84]:
Lower

A   -4.222385
dtype: float64

In [85]:
Upper

A    3.832477
dtype: float64

In [89]:
outlier = (data['A']>Upper['A']) | (data['A']<Lower['A'])

In [90]:
data[outlier].count()

A    1
dtype: int64

In [91]:
data[outlier]

Unnamed: 0,A
74,-5.23949
