# Data Wrangling

by Warda Gull

> Process of cleaning, 
          Transforming             
          Organizing

It helps to make data consistent and useful.

## Steps
1. Gathering data
2. Assessing the data(EDA)
3. Tools to clean data(libraries)
4. w to clean
    - Dealing with missing values
    - Correcting errors in data
        - Outliers removing 
          - Visualization 
          - IQR method 
          - Z-Score
    - Finding and Dropping Duplicates 
5. Transforming the data
    - Normalizing the data (Important)
        - Min-Max Scalling
        - Standard scaler / Z-score Normalization
        - Decimal Scalling 
        - Winsorization
        - Log Transformation
6. Organizing Data
   - Column Creation
   - Renaming 
7. Saving the processed data for future use

In [None]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df=sns.load_dataset('titanic')


In [None]:
df.isnull().sum() * 100 / len(df)

### Missing values

In [None]:
# Dealing with missing values

df.drop(columns='deck', inplace=True)
df.fillna(df['age'].mean(), inplace=True)
df.fillna(df['embarked'].mode()[0] , inplace=True)
df.fillna(df['embark_town'].mode()[0], inplace=True)

df.isnull().sum() * 100 / len(df)



In [None]:
df.dtypes

In [None]:
sns.histplot(df['age'])

In [None]:
# log transformation
#  check 1stly after applying the log transformation but its not working
df['age'] = np.log(df['age'])

In [None]:
sns.histplot(df['age'])

In [None]:
sns.histplot(df['fare'])

In [None]:
df['fare'] = np.log(df['fare'])
sns.histplot(df['fare'])

### Removing Outliers

In [None]:
# Visualization

sns.boxplot( y='age' , data=df)



In [None]:
sns.boxplot( y='fare' , data=df)


In [None]:
# removing Outliers through IQR

Q1=df['age'].quantile(0.25)
Q3=df['age'].quantile(0.75)
IQR=Q3-Q1
IQR

In [None]:
upper_bound= Q3 +1.5*IQR
lower_bound= Q1 -1.5*IQR

print(upper_bound,lower_bound)

In [None]:
df=df[(df['age']< upper_bound) & (df['age']> lower_bound)]


In [None]:
sns.boxplot( y='age' , data=df)


In [None]:
# removing outliers in fare
Q1=df['fare'].quantile(0.25)
Q3=df['fare'].quantile(0.75)

IQR=Q3-Q1

upper_bound= Q3 +1.5*IQR
lower_bound= Q1 -1.5*IQR

df=df[(df['fare']< upper_bound) & (df['fare']> lower_bound)]



In [None]:
sns.boxplot( y='fare' , data=df)


In [None]:
sns.histplot(df['fare'])

In [None]:
sns.histplot(df['age'])

In [None]:
# Z-score method
from scipy import stats
import numpy as np

z = np.abs(stats.zscore(df['age']))
threshold = 3

df = df[z < threshold]



In [None]:
sns.histplot(df['age'])

In [None]:
z = np.abs(stats.zscore(df['fare']))
threshold = 3

df = df[z < threshold]

In [None]:
sns.histplot(df['fare'])

### Drop Duplicates

In [None]:
df.shape

In [None]:
# finding duplicates

df.duplicated().sum()


In [None]:
# removing duplicates
df.drop_duplicates(inplace=True)
df.shape

# Transforming the data
# =================================
## Normalization

In [None]:
# pip install scikit-learn

from sklearn.preprocessing import MinMaxScaler # For scalling in between 0_1

Cols=['age', 'fare']
scaler= MinMaxScaler()

df[Cols]= scaler.fit_transform(df[Cols])
df

In [None]:
sns.histplot(df['age'])

In [None]:
sns.histplot(df['fare'])

In [None]:

from sklearn.preprocessing import StandardScaler # For scalling in between -3 to +3

df1=df

Cols=['age', 'fare']
scaler= StandardScaler()

df1[Cols]= scaler.fit_transform(df[Cols])

df1

# z= x−μ/  σ

# Where:
# z is the standardized value.
# x is the original value.
# μ is the mean of the feature.
# σ is the standard deviation of the feature

#  Note: Mean of coloum will become 0 and standard deviation will become 1 after scaling of features/coloumns



In [None]:
sns.histplot(df1['age'])

In [None]:
# Log Transformation

#  we usually take log just after dealing wit missing values and before removing outliers

df['age']= np.log(df['age'])
df['fare']= np.log(df['fare'])

df.head()

# we usually take log1p for the data which has 0 values in it.
# It is taken only for the positive values


In [None]:
sns.boxplot(data=df,x='sex', y='fare')

# log is not working for age

In [None]:
sns.boxplot(data=df,x='sex', y='age')


# Organizing the data

In [None]:
df['family'] = df['sibsp'] + df['parch'] 

In [None]:
sns.histplot(df['family'])

In [None]:
sns.swarmplot(data=df, x='sex', y='age', hue='family')

In [None]:
# renaming

df= df.rename(columns={'survived':'survival'})

In [None]:
df.columns

In [None]:
# Pivot table

table= pd.pivot_table(df, values='fare', columns= 'sex', index='pclass', aggfunc=np.sum)

In [None]:
table

In [None]:
# checking relations

sns.scatterplot(data=df, x='fare', y='age')

In [None]:
# checked Data after wrangling
df.head()

In [None]:
# saving the data

df.to_csv('titanic_cleaned.csv')