# HomeWork3 Data Cleaning 

## 1. Handling Missing Data Questions:
### * How do you identify and handle missing values in a Pandas DataFrame?
Answer: Missing values in datasets can be represented as "NaN", "NA", "N/A" or they can be just empty cells. The following dataframe has some missing values as an example:

In [22]:
import pandas as pd
import numpy as np
data_frame = pd.read_excel("C:\TO_PRACTICE_PYTHON\Customer Call List.xlsx")


In [23]:
data_frame

Unnamed: 0,CustomerID,First_Name,Last_Name,Phone_Number,Address,Paying Customer,Do_Not_Contact,Not_Useful_Column
0,1001,Frodo,Baggins,123-545-5421,"123 Shire Lane, Shire",Yes,No,True
1,1002,Abed,Nadir,123/643/9775,93 West Main Street,No,Yes,False
2,1003,Walter,/White,7066950392,298 Drugs Driveway,N,,True
3,1004,Dwight,Schrute,123-543-2345,"980 Paper Avenue, Pennsylvania, 18503",Yes,Y,True
4,1005,Jon,Snow,876|678|3469,123 Dragons Road,Y,No,True
5,1006,Ron,Swanson,304-762-2467,768 City Parkway,Yes,Yes,True
6,1007,Jeff,Winger,,1209 South Street,No,No,False
7,1008,Sherlock,Holmes,876|678|3469,98 Clue Drive,N,No,False
8,1009,Gandalf,,N/a,123 Middle Earth,Yes,,False
9,1010,Peter,Parker,123-545-5421,"25th Main Street, New York",Yes,No,True


In [24]:
# to determine how many rows and colums there are

data_frame.shape

(21, 8)

we can use "isnull()" method to detect missing values: 

In [25]:
data_frame.isnull()

Unnamed: 0,CustomerID,First_Name,Last_Name,Phone_Number,Address,Paying Customer,Do_Not_Contact,Not_Useful_Column
0,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,True,False
3,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False
5,False,False,False,False,False,False,False,False
6,False,False,False,True,False,False,False,False
7,False,False,False,False,False,False,False,False
8,False,False,True,False,False,False,True,False
9,False,False,False,False,False,False,False,False


#### however "isnull()" method returns as False only the values which are represented as 'NaN'. Thats why we can use "isin()" method:


In [26]:
missing_vals = ['N/a', ''] # making a list of words represent missing values
data_frame.isin(missing_vals) # now it returns false all 'missing values'

Unnamed: 0,CustomerID,First_Name,Last_Name,Phone_Number,Address,Paying Customer,Do_Not_Contact,Not_Useful_Column
0,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False
5,False,False,False,False,False,False,False,False
6,False,False,False,False,False,False,False,False
7,False,False,False,False,False,False,False,False
8,False,False,False,True,False,False,False,False
9,False,False,False,False,False,False,False,False


#### so, by handling missing vals I mean filling those vals by whatever you need or just drop those cells:

In [27]:
df2 = data_frame.fillna(value = 0) # again it will recognize only 'NaN', so we need to use 'mask()' function:

In [28]:
df2

Unnamed: 0,CustomerID,First_Name,Last_Name,Phone_Number,Address,Paying Customer,Do_Not_Contact,Not_Useful_Column
0,1001,Frodo,Baggins,123-545-5421,"123 Shire Lane, Shire",Yes,No,True
1,1002,Abed,Nadir,123/643/9775,93 West Main Street,No,Yes,False
2,1003,Walter,/White,7066950392,298 Drugs Driveway,N,0,True
3,1004,Dwight,Schrute,123-543-2345,"980 Paper Avenue, Pennsylvania, 18503",Yes,Y,True
4,1005,Jon,Snow,876|678|3469,123 Dragons Road,Y,No,True
5,1006,Ron,Swanson,304-762-2467,768 City Parkway,Yes,Yes,True
6,1007,Jeff,Winger,0,1209 South Street,No,No,False
7,1008,Sherlock,Holmes,876|678|3469,98 Clue Drive,N,No,False
8,1009,Gandalf,0,N/a,123 Middle Earth,Yes,0,False
9,1010,Peter,Parker,123-545-5421,"25th Main Street, New York",Yes,No,True


#### again it will recognize only 'NaN', so we need to use 'mask()' function: 

In [29]:
missing_vals = ['N/a', '', np.NaN] # making a list of words represent missing values
missing_vals_frame = data_frame.isin(missing_vals)
data_frame.mask(missing_vals_frame, 'missing')

Unnamed: 0,CustomerID,First_Name,Last_Name,Phone_Number,Address,Paying Customer,Do_Not_Contact,Not_Useful_Column
0,1001,Frodo,Baggins,123-545-5421,"123 Shire Lane, Shire",Yes,No,True
1,1002,Abed,Nadir,123/643/9775,93 West Main Street,No,Yes,False
2,1003,Walter,/White,7066950392,298 Drugs Driveway,N,missing,True
3,1004,Dwight,Schrute,123-543-2345,"980 Paper Avenue, Pennsylvania, 18503",Yes,Y,True
4,1005,Jon,Snow,876|678|3469,123 Dragons Road,Y,No,True
5,1006,Ron,Swanson,304-762-2467,768 City Parkway,Yes,Yes,True
6,1007,Jeff,Winger,missing,1209 South Street,No,No,False
7,1008,Sherlock,Holmes,876|678|3469,98 Clue Drive,N,No,False
8,1009,Gandalf,missing,missing,123 Middle Earth,Yes,missing,False
9,1010,Peter,Parker,123-545-5421,"25th Main Street, New York",Yes,No,True


## 2. Data Transformation Questions

#### How can you encode categorical variables in a Pandas DataFrame?

Answer: 
In Pandas, there are various methods available for encoding categorical variables, which can be chosen based on the characteristics of your data and the needs of your analysis.

In [31]:
# Label Encoding: This method assigns a unique integer to each category. 
# You can use the LabelEncoder from the sklearn.preprocessing module.

from sklearn.preprocessing import LabelEncoder

# Assuming df is your DataFrame and 'column' is the column you want to encode
label_encoder = LabelEncoder()
df['encoded_column'] = label_encoder.fit_transform(df['column'])

NameError: name 'df' is not defined

###### What is one-hot encoding, and when would you use it in data preprocessing?

Answer: One-hot encoding is a technique used to convert categorical variables into a binary matrix, where each category is represented by a binary vector. In this matrix, each column corresponds to a unique category, and each row represents an observation. If a particular observation belongs to a category, the value in the corresponding column is set to 1; otherwise, it is set to 0.

One-hot encoding is typically used in data preprocessing when dealing with categorical variables in machine learning tasks. It is particularly useful when the categorical variable does not have an inherent ordinal relationship among its categories, and the algorithm being used does not interpret ordinality well.

For example, in a dataset containing a "color" feature with categories "red," "blue," and "green," one-hot encoding would create three binary columns, one for each color. Each observation would have a 1 in the column corresponding to its color and 0s in the other columns.

Using one-hot encoding ensures that the algorithm treats each category equally without assuming any ordinal relationship between them. It helps prevent the model from learning spurious relationships based on the numerical values assigned to the categories.

## 3. Removing Duplicates Questions

##### How do you identify and remove duplicate rows from a DataFrame?

Answer: We can identify and remove duplicate rows from a DataFrame in Pandas using the duplicated() and drop_duplicates():

In [None]:
data_frame.duplicated()

In [None]:
data_frame.drop_duplicates()

##### Can you explain the difference between the duplicated() and drop_duplicates() methods in Pandas?

Answer: as you can see above, duplicated() method returns boolean type series where every row is False if it was a duplicate and 
True otherwise. drop_duplicates() method simple drops duplicated row

## 4. Data Scaling and Normalization Questions:

##### Discuss the importance of feature scaling in machine learning.

Answer: Feature scaling is a crucial preprocessing step in machine learning, especially for algorithms that rely on distance-based calculations or gradient descent optimization. It involves transforming the features of a dataset to a similar scale, typically between 0 and 1 or with a mean of 0 and a standard deviation of 1. Here's why feature scaling is important:

Improves Convergence: Feature scaling helps algorithms converge faster during optimization. Algorithms like gradient descent converge more quickly when features are on similar scales. If features have vastly different ranges, it may take longer for the optimization algorithm to find the optimal solution.

Prevents Dominance of Features: In algorithms that use distance measures, such as k-nearest neighbors (KNN) or support vector machines (SVM), features with larger scales may dominate the calculation. Scaling ensures that each feature contributes proportionally to the distance metric.

Enhances Performance: Feature scaling can lead to better model performance and generalization. By putting features on similar scales, the model can learn more efficiently and make better predictions. It prevents the model from being biased towards features with larger magnitudes.

Ensures Stability: Scaling features can make the model more stable and robust to different input distributions. It reduces the sensitivity of the model to the scale of the input features, making it less likely to overfit or underfit the data.

Facilitates Interpretability: Feature scaling can make the coefficients or importance scores of features more interpretable. When features are on a similar scale, it becomes easier to compare their relative importance in the model.

Common techniques for feature scaling include Min-Max scaling, Standardization (Z-score normalization), and Robust scaling. The choice of scaling method depends on the specific requirements of the algorithm and the characteristics of the data.

In summary, feature scaling plays a vital role in machine learning by ensuring that algorithms perform optimally, converge efficiently, and produce reliable and interpretable results. It is an essential preprocessing step in building accurate and robust machine learning models.

##### Explain the difference between min-max scaling and z-score normalization.

Min-Max Scaling:

Min-Max scaling, also known as normalization, rescales features to a fixed range, typically between 0 and 1.
It works by subtracting the minimum value of the feature and then dividing by the difference between the maximum and minimum values.

The formula for Min-Max scaling is: 
$$ \begin{gather*}
x_{scaled}=(x-x_{min})/(x_{max}-x_{min})
\end{gather*} $$

This method preserves the original distribution of the data and is useful when the features have a known minimum and maximum value.
Min-Max scaling is sensitive to outliers, as they can disproportionately affect the range of the scaled values.

In [None]:
from sklearn.preprocessing import MinMaxScaler
import pandas as pd

# Sample DataFrame
data = pd.DataFrame({
    'Feature1': [10, 20, 30, 40, 50],
    'Feature2': [1, 2, 3, 4, 5]
})

# Initialize the MinMaxScaler
scaler = MinMaxScaler()

# Fit the scaler to the data and transform it
scaled_data = scaler.fit_transform(data)

# Convert the scaled data back to a DataFrame
scaled_df = pd.DataFrame(scaled_data, columns=data.columns)

print("Original Data:")
print(data)
print("\nScaled Data using Min-Max Scaling:")
print(scaled_df)

Z-score Normalization (Standardization):

Z-score normalization standardizes features by transforming them to have a mean of 0 and a standard deviation of 1.
It works by subtracting the mean of the feature and then dividing by the standard deviation.
The formula for Z-score normalization is:

$$ \begin{gather*}
x_{scaled} = (x - \alpha)/(\theta)
\end{gather*}
$$

This method centers the data around 0 and ensures that the scaled values have a standard deviation of 1.
Z-score normalization is less sensitive to outliers compared to Min-Max scaling because it uses the mean and standard deviation, which are less affected by extreme values.

In [None]:
from sklearn.preprocessing import MinMaxScaler
import pandas as pd

# Sample DataFrame
data = pd.DataFrame({
    'Feature1': [10, 20, 30, 40, 50],
    'Feature2': [1, 2, 3, 4, 5]
})

# Initialize the MinMaxScaler
scaler = MinMaxScaler()

# Fit the scaler to the data and transform it
scaled_data = scaler.fit_transform(data)

# Convert the scaled data back to a DataFrame
scaled_df = pd.DataFrame(scaled_data, columns=data.columns)

print("Original Data:")
print(data)
print("\nScaled Data using Min-Max Scaling:")
print(scaled_df)

## 5. Handling Outliers Questions

#### What are outliers, and why might they impact machine learning models?

Outliers are data points that deviate significantly from the rest of the data in a dataset. They can be unusually high or low values compared to the majority of the data points. Outliers can occur due to measurement errors, experimental errors, or natural variations in the data.

Outliers might impact machine learning models in several ways:

Skewing Statistical Measures: Outliers can significantly affect statistical measures such as the mean and standard deviation. The mean is sensitive to outliers, and even a single outlier can substantially shift its value, leading to biased estimates of central tendency and dispersion.

Distorting Relationships: Outliers can distort the relationships and patterns present in the data. Machine learning algorithms often rely on capturing underlying patterns and relationships between features and the target variable. Outliers may introduce noise or artificial patterns, leading the model to learn incorrect relationships.

Influencing Model Performance: Outliers can influence the performance of machine learning models, particularly those sensitive to the scale and distribution of data. Algorithms like linear regression and k-nearest neighbors (KNN) are highly sensitive to outliers as they can disproportionately affect the model's predictions.

Decreasing Robustness: Outliers can reduce the robustness and generalizability of machine learning models. A model trained on data with outliers may perform well on the training set but poorly on unseen data. It may fail to generalize to new observations or real-world scenarios where outliers are common.

Impact on Distance-based Algorithms: Algorithms that rely on distance metrics, such as KNN or clustering algorithms, can be significantly impacted by outliers. Outliers may distort the calculation of distances, leading to erroneous clustering or classification decisions.

Increased Model Complexity: Outliers may lead to the overfitting of machine learning models. Models may learn to fit the outliers, resulting in overly complex models that perform poorly on new data.

To mitigate the impact of outliers on machine learning models, it's essential to perform outlier detection and treatment during data preprocessing. This may involve techniques such as removing outliers, transforming features, or using robust algorithms that are less sensitive to outliers. Additionally, domain knowledge and context are crucial for identifying whether outliers are genuine data points or erroneous measurements that should be discarded.

### Describe different methods for detecting outliers in a dataset in Python


In Python, several methods can be used to detect outliers in a dataset. Here are some commonly used techniques:

Visual Inspection:

Visualizing the data using scatter plots, box plots, histograms, or QQ plots can often reveal outliers.
Outliers may appear as points that are distant from the bulk of the data or as data points that lie outside the whiskers of a box plot.
Descriptive Statistics:

Calculating descriptive statistics such as mean, median, standard deviation, and quartiles can help identify outliers.
Data points that lie significantly beyond the mean or median ± a certain number of standard deviations or quartiles may be considered outliers.
Z-Score Method:

Calculate the Z-score for each data point, which represents how many standard deviations away it is from the mean.
Data points with a Z-score above a certain threshold (e.g., ±3) may be considered outliers.
Interquartile Range (IQR) Method:

Calculate the IQR by subtracting the first quartile (Q1) from the third quartile (Q3).
Define a threshold for outliers, typically as Q1 - 1.5 * IQR or Q3 + 1.5 * IQR.
Data points lying outside this threshold are considered outliers.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise):

DBSCAN is an unsupervised clustering algorithm that identifies outliers as points that lie in low-density regions.
Points that do not belong to any cluster or are in clusters with few points are considered outliers.
Isolation Forest:

Isolation Forest is an ensemble-based anomaly detection algorithm that isolates outliers by randomly selecting features and splitting data points along them.
Outliers are identified as data points that require fewer splits to be isolated from the rest of the data.
Local Outlier Factor (LOF):

LOF is an algorithm that calculates the local density of a data point relative to its neighbors.
Points with significantly lower density compared to their neighbors are considered outliers.

In [None]:
import numpy as np

# Generate some random data
data = np.random.normal(loc=0, scale=1, size=100)

# Calculate Z-scores
z_scores = (data - np.mean(data)) / np.std(data)

# Define threshold for outliers (e.g., ±3)
threshold = 3

# Identify outliers
outliers = np.abs(z_scores) > threshold

print("Outliers:")
print(data[outliers])

##### How can you handle outliers in a continuous numerical variable in Python?


Handling outliers in a continuous numerical variable in Python involves various strategies aimed at either mitigating their impact on the analysis or correcting them. Here are several approaches commonly used:

Removing Outliers:

Identify outliers using one of the outlier detection methods mentioned earlier.
Remove outliers from the dataset. This can be done by either:
Removing the entire row containing the outlier.
Replacing the outlier with a suitable value, such as the median or mean of the variable.
Use caution when removing outliers, as it may lead to loss of information and bias in the analysis.
Transforming Variables:

Apply transformations to the variable that make the distribution more symmetric or closer to normal, which can reduce the impact of outliers.
Common transformations include taking the logarithm, square root, or reciprocal of the variable.
Transformation methods such as Box-Cox or Yeo-Johnson transformations can also be used.
Winsorization:

Winsorization involves capping the extreme values of the variable at a certain percentile (e.g., 95th or 99th percentile).
Outliers above the upper cap are replaced with the value at the specified percentile, while outliers below the lower cap are replaced with the value at the corresponding lower percentile.
Binning or Discretization:

Grouping continuous numerical values into bins or discrete categories can help reduce the impact of outliers.
This approach can be useful when outliers occur infrequently and have a limited impact on the overall distribution.
Using Robust Algorithms:

Robust statistical methods and algorithms are less sensitive to outliers and can be used as an alternative to traditional methods.
Examples include robust regression techniques like RANSAC (RANdom SAmple Consensus) and robust clustering algorithms like DBSCAN (Density-Based Spatial Clustering of Applications with Noise).
Model-based Approaches:

Train machine learning models that are robust to outliers or less affected by their presence.
Ensemble methods like Random Forest and Gradient Boosting are known to be relatively robust to outliers.
Imputation:

Replace outliers with more plausible values based on domain knowledge or statistical methods.
For example, outliers can be replaced with the median or mean of the variable, or imputed using predictive modeling techniques.
Treating Outliers as a Separate Category:

Sometimes outliers may represent genuinely different observations or special cases.
Treat outliers as a separate category or class in the analysis, rather than removing or modifying them.