# Exploratory Data Analysis and Data Cleaning

- Data cleaning is the process of identifying and correcting errors and inconsistencies in data sets so that they can be used for analysis. 
- This is to understand what is happening within their businesses, deliver trustworthy analytics any user can leverage, and help their organizations operate more efficiently.
- Create functions to automate the repetitive parts of EDA and Data Cleaning.
- Another benefit of using functions in the EDA and Data Cleaning is to
    - eliminate the inconsistency of results caused by accidental differences in the code.
    
Data Quality
- accuracy
- completeness
- consistency
- integrity
- timeliness
- uniformity
- validity

Data characteristics and attributes are used to measure the cleanliness and overall quality of data sets
    
Steps to follow:
- Modifying the column names
    - Make sure feature names follow certain format
        - Lowercase
        - Spaces represented by underscore
- Examine data to 
    - Inspect data types, 
    - Identify null values, and 
    - Inspect the summary statistics for each column available.
- Casting/ Data type convertsion
- Separate Data Types
    - Make sure numerical and categorical/ object type columns are separated for further analysis   
- Dealing with Duplicates
- Dealing with Missing Values
- Dealing this Outliers
- Dealing wih Noisy Data
    - Data binning
- For numerical fields, Do some Normalisation
- For Object filds, Do some Stirng manipulation
- Data Transformation - Column corrections
- Data Transformation - Creating Computed fields/columns.
- Explore Relationship with Target variable (Machine learning)
    - Explore whether there is a relationship between our potential feature columns and our target
        - Calculate the correlations between potential features and our target 
        - Correlations won’t be calculated for non-numeric columns.

# Automating Data Cleaning

- To make sure that our processing steps are fairly generic and can adapt to various types of datasets.
- Things to look at
    - What format does my data come in? CSV, JSON, text? Or another format? 
        - How am I going to handle this format?
    - What data types do our data features come in? 
        - Does our dataset contain categorical and/or numerical data? 
        - How do we deal with each? 
        - Do we want to one-hot encode our data, and/or perform data type transformations?
    - Does our data contain missing values? 
        - If yes, how do we deal with them? 
        - Do we want to perform some imputation technique? 
        - Or can we safely delete the observations with missing values?
    - Does our data contain outliers? 
        - If yes, do we apply a regularization technique, or do we leave them as they are? …and wait, 
        - what do we even consider as an “outlier”?

#  Rename column names

- Column names that have capital letters or space in between, very often we need to change the column names and replace them with lowercase and underscore 
- To make the column names as explicit as possible such that your friends will roughly know what a column contains just by looking at the column name.

In [None]:
def rename_col(df): 
    '''
    AIM    -> rename column names
    
    INPUT  -> df
    
    OUTPUT -> updated df with new column names 
    ------
    '''
    df.rename(index=str, columns={'col_1': 'new_col_1',
                                  'col_2': 'new_col_1'}, inplace=True)
    return df

In [None]:
train.columns = [i.replace(' ', '_').lower() for i in train.columns] 

In [None]:
def col_cleaner(df):
    df.columns = [x.replace("leadingword", '').replace('anotherword','') for x in df.columns]
    df.fillna('', inplace = True)
    df.reset_index(drop = True, inplace = True)
    
    return df

# Data Inspection/examination

Define a function, that takes my data as an input, and returns a data frame where each 
- feature in my data set is now a row and the summary statistics are columns.

The function will take a data frame as an input and calculate summary statistics to reveal insights about the data.

In [None]:
def ames_eda(df): 
    eda_df = {}
    eda_df['null_sum'] = df.isnull().sum()
    eda_df['null_pct'] = df.isnull().mean()
    eda_df['dtypes'] = df.dtypes
    eda_df['count'] = df.count()
    eda_df['mean'] = df.mean()
    eda_df['median'] = df.median()
    eda_df['min'] = df.min()
    eda_df['max'] = df.max()
    
    return pd.DataFrame(eda_df)
ames_eda(train)

### Inspect what all my column types and evaluate if there are any implications for EDA

In [None]:
df.dtypes.value_counts()

# Separating our features into numerical and categorical early on

To see all the different object columns, the following will return a list

Inspect the data dictionary for a high level explanation of what these columns represent.


In [None]:
df.select_dtypes(include=['object']).columns

In [None]:
cat_df = airbnb.select_dtypes(include=['object'])
num_df = airbnb.select_dtypes(exclude=['object'])

def printColumnTypes(non_numeric_df, numeric_df):
    '''separates non-numeric and numeric columns'''
    print("Non-Numeric columns:")
    for col in non_numeric_df:
        print(f"{col}")
    print("")
    print("Numeric columns:")
    for col in numeric_df:
        print(f"{col}")
        
printColumnTypes(cat_df, num_df)

### Exploring the Object Columns:

- Object columns may be categorical or ordinal features that:
    - can be converted to numeric values through data cleaning, and
    - are intuitively related to the price of a house — 
        - a house with central air would logically have a higher sale price than one without, holding all else constant.

# Data validation: 

### Duplicate values.

the nunique() function can be used to check for duplicate values.

In [None]:
import pandas as pd

# Create a DataFrame with missing values and duplicate values
data = {'col1': [1, 2, None, 4, 5],
        'col2': ['A', 'B', None, 'D', 'E'],
        'col3': ['X', 'Y', 'Z', 'X', 'Y']}
df = pd.DataFrame(data)

# Check for duplicate values
duplicate_values = df.nunique()
print("\nDuplicate Values:")
print(duplicate_values)

# Check for missing values
missing_values = df.isnull().any()
print("Missing Values:")
print(missing_values)

# Dealing with Duplicates

- When there are identical rows in the dataset, it is duplicate data problem. 
- It can happen because of data combination mistake (same row coming from multiple sources), 
    - the user might submit his or her answer twice, etc. 
    - Ideal way to handle the issue is just to delete the copy rows.

- rows have duplicate values
- duplicated values, call duplicated().any() on your data frame, and if it’s true, use the drop_duplicates function.
- specify columns where you want to remove duplicate values

##### Check for duplicates
- check weather there is duplicate row or not

In [None]:
df.loc[df.duplicated()]

In [None]:
airbnb.duplicated().any()

#if true
airbnb.drop_duplicates()

#if you want to drop duplicates at specific column
airbnb.drop('col_name', axis=1, inplace=True).drop_duplicates()

# Dealing with Duplicates

### Removing duplicate rows: 

The drop_duplicates() function in pandas can be used to remove duplicate rows from a DataFrame.

In [None]:
import pandas as pd

# Create a DataFrame with duplicate rows
data = {'col1': [1, 2, 3, 3, 4, 5, 5],
        'col2': ['A', 'B', 'C', 'C', 'D', 'E', 'E']}
df = pd.DataFrame(data)

# Remove duplicate rows
df = df.drop_duplicates()

# Print the DataFrame without duplicate rows
print(df)

# Missing Values

- place of empty cells itself can tell you something useful
    - NA values are back to back only at the tail or in the middle of the dataset. This means, 
        - there may be a technical problem during data collection .
        - analyze data collection process for that particular sequence of samples and try to find the origin of the issue.
        
### Missing values

The isnull() function in pandas can be used to check for missing values in a DataFrame and 

### Outputing columns in the dataframe that have null values (NaN) in them.

- use the .info() method and check if the lengths of certain columns are less than the length of the dataframe using the len() function.

- your data may have null values showing up differently as “?” or “0”. 
    - It is advisable to scan your data and if the null values are not NaN then 
        - you can use the .replace() method.

In [None]:
import pandas as pd

df = pd.read_csv(r'C:\Users\Directory\filename.csv')

In [None]:
# Convert into NaN values and show columns with NaN values

df.replace("?", np.nan, inplace = True)

# Get Boolean output with all True values being the null values.

missing_data=df.isnull()

missing_data.head()

In [None]:
for column in missing_data.columns.values.tolist():
    print(column)
    print (missing_data[column].value_counts())
    print("")

### Check missing data 2.0

- If you want to check the number of missing data for each column, this is the fastest way to go with. 
- This gives you a better understanding of which columns have higher number of missing data that determine your next action.

In [None]:
def check_missing_data(df):
    # check for any missing data in the df (display in descending order)
    return df.isnull().sum().sort_values(ascending=False)

- Call isnull() and sum() to get a count of how many null values there are in each column.

In [None]:
df.isnull().sum()

### Checking missing values 3.0

In [None]:
def missing_cols(df):
    '''prints out columns with its amount of missing values'''
    total = 0
    for col in df.columns:
        missing_vals = df[col].isnull().sum()
        total += missing_vals
        if missing_vals != 0:
            print(f"{col} => {df[col].isnull().sum()}")
    
    if total == 0:
        print("no missing values left")
            
missing_cols(airbnb)

##### Total and percentage of missing data in each column

In [None]:
def intitial_eda_checks(df):
    '''
    Takes df
    Checks nulls
    '''
    if df.isnull().sum().sum() > 0:
        mask_total = df.isnull().sum().sort_values(ascending=False) 
        total = mask_total[mask_total > 0]

        mask_percent = df.isnull().mean().sort_values(ascending=False) 
        percent = mask_percent[mask_percent > 0] 

        missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
    
        print(f'Total and Percentage of NaN:\n {missing_data}')
    else: 
        print('No NaN found.')

##### Getting a list of columns that have missing values over that threshold:

In [None]:
def view_columns_w_many_nans(df, missing_percent):
    '''
    Checks which columns have over specified percentage of missing values
    Takes df, missing percentage
    Returns columns as a list
    '''
    mask_percent = df.isnull().mean()
    series = mask_percent[mask_percent > missing_percent]
    columns = series.index.to_list()
    print(columns) 
    return columns

##### Get the percentage of missing values

In [None]:
def perc_missing(df):
    '''prints out columns with missing values with its %'''
    for col in df.columns:
        pct = df[col].isna().mean() * 100
        if (pct != 0):
            print('{} => {}%'.format(col, round(pct, 2)))
    
perc_missing(airbnb)

In [None]:
import numpy as np
print("Amount of missing values in - ")
for column in df.columns:
    percentage_missing = np.mean(df[column].isna())
    print(f'{column} : {round(percentage_missing*100)}%')

### Heat mapping missing values

##### Visualise Missing values

In [None]:
import missingno as msno
msno.matrix(df)

In [None]:
plt.figure(figsize=(10, 6))
sns.heatmap(airbnb.isnull(), yticklabels=False, cmap='viridis', cbar=False)

In [None]:
sns.heatmap(df.isna().transpose())

### Output a dataframe without NaN values for a particular column
- Useful when you want to output a dataframe with all the available data that a column has.
    - dataframe with all customers’ information and you want to output an updated dataframe with all the available customers’ ID and
    - remove the rows with missing customers’ ID.
- Similar concept can be applied to time series data where the column is timestamp

In [None]:
def remove_nan_values(df):
    '''
    AIM    -> remove NaN values of a particular column and output the whole dataframe
     
    INPUT  -> df
    
    OUTPUT -> updated df without NaN values for a particular column 
    ------
    '''
    df = df[df['col_1'].notnull()]
    return df

# Dealing with missing values

- dropping rows/columns, you’re essentially losing information that might be useful for prediction
    - If more than 70–80% of column is NA, you can drop the column.

- imputing values will introduce bias to your data but it still might better than removing your features.
    - If the NA values are in the column which is an optional question in the form, that column can be decoded like the user answered (1) or not answered (0).

### Technique to deal with missing values

1. Drop the feature
2. Drop the row

### Dropping 

##### Remove any columns that had 40% or more of its data as null values

In [None]:
NA_val = df_cleaned.isna().sum()
def na_filter(na, threshold = .4): 
#only select variables that passees the threshold
    col_pass = []
    for i in na.keys():
        if na[i]/df_cleaned.shape[0]<threshold:
            col_pass.append(i)
    return col_passdf_cleaned = df_cleaned[na_filter(NA_val)]
df_cleaned.columns

##### Drop the columns with too many missing values (over a certain threshold you specify)

In [None]:
def drop_columns_w_many_nans(df, missing_percent):
    '''
    Takes df, missing percentage
    Drops the columns whose missing value is bigger than missing percentage
    Returns df
    '''
    series = view_columns_w_many_nans(df, missing_percent=missing_percent)
    list_of_cols = series.index.to_list()
    df.drop(columns=list_of_cols)
    print(list_of_cols)
    return df

#####  Drop multiple columns

In [None]:
def drop_multiple_col(col_names_list, df): 
    '''
    AIM    -> Drop multiple columns based on their column names 
    
    INPUT  -> List of column names, df
    
    OUTPUT -> updated df with dropped columns 
    ------
    '''
    df.drop(col_names_list, axis=1, inplace=True)
    return df

##### Drop the feature

In [None]:
# Drop unnecessary columns that are not important
colsToDrop = ['id','host_name','last_review']

airbnb.drop(colsToDrop, axis=1, inplace=True)

missing_cols(airbnb)

##### Drop the whole row

##### Drop the whole column (axis = 1)

In [None]:
#Drop whole row with NaN
df.dropna(subset=["price"], axis=0, inplace=True)

df.reset_index(drop=True,inplace=True)

#Replacing numerical null values with the mean

avg_loss = df["losses"].astype("float").mean(axis=0)

df["losses"].replace(np.nan, avg_loss, inplace=True)

##### Drop the row

In [None]:
# remove rows with missing values in price
airbnb['price'].dropna(inplace=True)

### Ways to Handle Missing Values

- Drop missing values
- Ignore tuples with missing values
- Imputation etc

In [None]:
import pandas as pd
import numpy as np

# Create sample DataFrame
data = {
    'A': [1, 2, np.nan, 4, 5],
    'B': [6, 7, 8, 9, np.nan],
    'C': [10, np.nan, 12, 13, 14]
}

df = pd.DataFrame(data)

# Drop missing values
df_dropped = df.dropna()  # Drop rows with any missing values
df_dropped_column = df.dropna(axis=1)  # Drop columns with any missing values

# Ignore tuples with missing values
df_ignore = df.dropna(how='any')  # Drop rows with any missing values

In [None]:
# Imputation
df_imputed_mean = df.fillna(df.mean())  # Fill missing values with column mean
df_imputed_median = df.fillna(df.median())  # Fill missing values with column median
df_imputed_custom = df.fillna({'A': 0, 'B': 1, 'C': 2})  # Fill missing values with custom values

print(df_dropped)
print(df_dropped_column)
print(df_ignore)
print(df_imputed_mean)
print(df_imputed_median)
print(df_imputed_custom)

### Imputing/ Replacing

3. Impute the missing value
    - mean, median, mode;
    - kNN;
        - common practice to predict missing values in our data with the help of various regression or classification models.
    - zero or constant and etc.
4. Replace it

The choice of how we handle missing values will depend mostly on:

- the data type (numerical or categorical) and
- how many missing values we have relative to the number of total samples we have 
    - (deleting 1 observation out of 100k will have a different impact than deleting 1 out of 100)

##### Normally distributed data

- Get all the values that are within 2 standard deviations from the mean. 
- Next, fill in the missing values by generating random numbers between (mean — 2 * std) & (mean + 2 * std)

In [None]:
rand = np.random.randint(average_age - 2*std_age, average_age + 2*std_age, size = count_nan_age)
dataframe["age"][np.isnan(dataframe["age"])] = rand

##### Numerical data
- Replace NaN with the mean.

##### Categorical data
- Replace NaN with the frequency.

- Replace with another function 
    - here we can use a function with np.vectorize() or the .apply() method

In [None]:
#Replacing categorical data with the frequency

df['doors'].value_counts().idmax()
#output is 'four'

df['doors'].replace(np.nan,'four',inplace= True)

###### Handling missing data: 

- The fillna() function in pandas can be used to replace missing values with a specific value or using a forward or backward fill. 
- The dropna() function can be used to remove rows or columns that contain missing values.

In [None]:
import pandas as pd

# Create a DataFrame with missing values
data = {'col1': [1, 2, None, 4, 5],
        'col2': ['A', 'B', None, 'D', 'E']}
df = pd.DataFrame(data)

# Replace missing values with a specific value
df_filled = df.fillna('Unknown')

# Remove rows with missing values
df_dropped = df.dropna()

# Print the DataFrames
print("DataFrame with filled missing values:")
print(df_filled)
print("\nDataFrame with dropped missing values:")
print(df_dropped)

##### Using a python loop in pandas to fill in missing values

In [None]:
def fill_missing_value(df, fillvalue = value):
    for col in df.columns:
        for i,value in enumerate(df[col].values):
            if value == 'None' or value == ' ':
                df[col][i] = fillvalue
    return df

In [None]:
def painful_fillna(df, fillvalue = 0):
    df2 = df.copy()
    for col in df2.columns:
        for i, value in enumerate(df2[col].values):
            if np.isnan(value):
                df2[col][i] = fillvalue
    return df2

##### Replacing missing values and strings with 0 and values

In [None]:
ff = fuction(x)

{
    res = vector()   # create an empty vector to store counts for each elements
    for (i in 1:lenth(x)){  # itrate through each element
        res[i] = if else(x[i] == " ", 0, lenth(unlist(str.split(x[i], "/t")))) # if element is space return 0, ekse split string by \t and count new strings
    }
    return res returned stored values
}

df(sapply(dt, function(x) ff(x)))

##### Filling in or replacing values columnwise

In [None]:
df = df.apply['col'].astype(str).apply(lambda x: x.strip().replace('', np.nan))

In [None]:
df = df.apply(lambda x: x.str.strip().replace(' '), np.nan)

In [None]:
df['col'] = df['col'].apply(lambda x: x.str.strip().replace(' '), np.nan)

In [None]:
df = df.apply(lambda x: x.strip() if isnstance(x, str) else x).replace('', np.nan)

##### Imputing
For imputing, there are 3 main techniques shown below.

- fillna — filling in null values based on given value (mean, median, mode, or specified value)
- bfill / ffill — stands for backward fill and forward fill (filling in missing values based on the value after or before the column.)
- Simple Imputer — Sk-learn’s built-in function that imputes missing values (commonly used alongside a pipeline when building ML models)

In [None]:
# imputing price with mean
price_mean_value = round(airbnb['price'].mean(), 2)
airbnb['price'].fillna(price_mean_value, inplace=True)

# imputing price with median
price_median_value = round(airbnb['price'].median(), 2)
airbnb['price'].fillna(price_median_value, inplace=True)

# imputing with bfill or ffill
airbnb['price'].bfill(inplace=True)
airbnb['price'].ffill(inplace=True)

# imputing with SimpleImputor from the sklearn library
from sklearn.impute import SimpleImputer
# define the imputer
imr = SimpleImputer(missing_values=np.nan, strategy='mean') # or median

airbnb[['price']] = imr.fit_transform(airbnb[['price']])

# use strategy = 'most_frequent' for categorical data

##### Replace
To replace values, the fillna function is also used.

- You define the value you want to replace in the key, and the substitute in the value — {column_name: replacement_for_NA}

In [None]:
# replace null values in reviews_per_month with 0 
airbnb.fillna({'reviews_per_month':0}, inplace=True)

missing_cols(airbnb)

In [None]:
# replace null values in name with 'None'
airbnb.fillna({'name':'None'}, inplace=True)

missing_cols(airbnb)

# 6 Different Ways to Compensate for Missing Values In a cross-sectional datasets ( Time-series dataset is a different story )
Popular strategies to statistically impute missing values in a dataset

- Missing Values are often encoded as NaNs, blanks or any other placeholders.
- algorithms such as scikit-learn estimators assume that all values are numerical and have and hold meaningful value.

### Handle this problem
- is to get rid of the observations that have missing data. 
    - risk losing data points with valuable information. 
- A better strategy would be to impute the missing values.
    - we need to infer those missing values from the existing part of the data. 

### Three main types of missing data: 
- Missing completely at random (MCAR)
- Missing at random (MAR)
- Not missing at random (NMAR)

###### 1) Do Nothing:
- let the algorithm handle the missing data. 
- Some algorithms can factor in the missing values and learn the best imputation values for the missing data based on the training loss reduction 
    - (ie. XGBoost)
- Some other algorithms have the option to just ignore them 
    - (ie. LightGBM — use_missing=false).
- other algorithms will panic and throw an error complaining about the missing values 
    - (ie. Scikit learn — LinearRegression).
        - For such cases, handle the missing data and clean it before feeding it to the algorithm.
        
##### 2) Imputation Using (Mean/Median) Values:
- calculating the mean/median of the non-missing values in a column and 
- then replacing the missing values within each column separately and independently from the others

- Pros:
    - Easy and fast.
    - Works well with small numerical datasets.
- Cons:
    - Doesn’t factor the correlations between features. It only works on the column level.
    - Will give poor results on encoded categorical features (do NOT use it on categorical features).
    - Not very accurate.
    - Doesn’t account for the uncertainty in the imputations.

In [1]:
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import mean_squared_error
from math import sqrt
import random
import numpy as np
random.seed(0)

In [2]:
#Fetching the dataset
import pandas as pd
dataset = fetch_california_housing()
train, target = pd.DataFrame(dataset.data), pd.DataFrame(dataset.target)
train.columns = ['0','1','2','3','4','5','6','7']
train.insert(loc=len(train.columns), column='target', value=target)

In [4]:
#Randomly replace 40% of the first column with NaN values
column = train['0']
print(column.size)
missing_pct = int(column.size * 0.4)
i = [random.choice(range(column.shape[0])) for _ in range(missing_pct)]
column[i] = np.NaN
print(column.shape[0])

20640
20640


In [5]:
#Impute the values using scikit-learn SimpleImpute Class
from sklearn.impute import SimpleImputer
imp_mean = SimpleImputer( strategy='mean') #for median imputation replace 'mean' with 'median'
imp_mean.fit(train)
imputed_train_df = imp_mean.transform(train)

##### 3)  Imputation Using (Most Frequent) or (Zero/Constant) Values:
- **Most Frequent** is another statistical strategy to impute missing values
    - It works with categorical features (strings or numerical representations)
    - by replacing missing data with the most frequent values within each column.
- **Zero or Constant imputation** — it replaces the missing values with either zero or any constant value you specify

- Pros:
    - Works well with categorical features.
- Cons:
    - It also doesn’t factor the correlations between features.
    - It can introduce bias in the data.

In [None]:
#Impute the values using scikit-learn SimpleImpute Class

from sklearn.impute import SimpleImputer
imp_mean = SimpleImputer( strategy='most_frequent')
imp_mean.fit(train)
imputed_train_df = imp_mean.transform(train)

##### 4) Imputation Using k-NN:
- The k nearest neighbours is an algorithm that is used for simple classification. 
- The algorithm uses ‘feature similarity’ to predict the values of any new data points. 
    - This means that the new point is assigned a value based on how closely it resembles the points in the training set. 
    - This can be very useful in making predictions about the missing values by finding the k’s closest neighbours to the observation with missing data and then imputing them based on the non-missing values in the neighbourhood.
        - using Impyute library which provides a simple and easy way to use KNN for imputation
- It creates a basic mean impute then uses the resulting complete list to construct a KDTree. 
- Then, it uses the resulting KDTree to compute nearest neighbours (NN). 
- After it finds the k-NNs, it takes the weighted average of them.

- Pros:
    - Can be much more accurate than the mean, median or most frequent imputation methods (It depends on the dataset).
- Cons:
    - Computationally expensive. KNN works by storing the whole training dataset in memory.
    - K-NN is quite sensitive to outliers in the data (unlike SVM)

In [None]:
import sys
from impyute.imputation.cs import fast_knn
sys.setrecursionlimit(100000) #Increase the recursion limit of the OS

# start the KNN training
imputed_training=fast_knn(train.values, k=30)

##### 5) Imputation Using Multivariate Imputation by Chained Equation (MICE)

- This type of imputation works by filling the missing data multiple times. 
- Multiple Imputations (MIs) are much better than a single imputation as it measures the uncertainty of the missing values in a better way. 
- The chained equations approach is also very flexible and can handle different variables of different data types 
    - (ie., continuous or binary) as well as 
    - complexities such as bounds or survey skip patterns.

In [None]:
from impyute.imputation.cs import mice

# start the MICE training
imputed_training=mice(train.values)

###### 6) Imputation Using Deep Learning (Datawig):
- This method works very well with categorical and non-numerical features. 
- It is a library that learns Machine Learning models using Deep Neural Networks to impute missing values in a dataframe. 
- It also supports both CPU and GPU for training.
- Pros:
    - Quite accurate compared to other methods.
    - It has some functions that can handle categorical data (Feature Encoder).
    - It supports CPUs and GPUs.
- Cons:
    - Single Column imputation.
    - Can be quite slow with large datasets.
    - You have to specify the columns that contain information about the target column that will be imputed.

In [None]:
import datawig

df_train, df_test = datawig.utils.random_split(train)

#Initialize a SimpleImputer model
imputer = datawig.SimpleImputer(
    input_columns=['1','2','3','4','5','6','7', 'target'], # column(s) containing information about the column we want to impute
    output_column= '0', # the column we'd like to impute values for
    output_path = 'imputer_model' # stores model data and metrics
    )

#Fit an imputer model on the train data
imputer.fit(train_df=df_train, num_epochs=50)

#Impute missing values and return original dataframe with predictions
imputed = imputer.predict(df_test)

##### Other Imputation Methods:
**Stochastic regression imputation:**
- It is quite similar to regression imputation which tries to predict the missing values by regressing it from other related variables in the same dataset plus some random residual value.

**Extrapolation and Interpolation:**
- It tries to estimate values from other observations within the range of a discrete set of known data points.

**Hot-Deck imputation:**
- Works by randomly choosing the missing value from a set of related and similar variables.


### Pipeline will follow the strategy

- Imputation > deletion, and 
- will support the following techniques: 
    - prediction with Linear and Logistic Regression, 
    - imputation with 
        - K-NN, 
        - mean, 
        - median and 
        - mode, as well as 
        - deletion.
        
##### Function
- create a separate class for handling missing values. 
- The function handle below will handle numerical and categorical missing values in a different manner: 
    - some imputation techniques might be applicable only for numerical data, 
    - whereas some only for categorical data. 
    - Let’s look at the first part of it which handles numerical features
    
- **How it works**
- checks which handling method has been chosen for numerical and categorical features. 
- The default setting is set to ‘auto’ which means that:
    - numerical missing values 
        - will first be imputed through prediction with Linear Regression, and 
        - the remaining values will be imputed with K-NN
    - categorical missing values 
        - will first be imputed through prediction with Logistic Regression, and 
        - the remaining values will be imputed with K-NN
- Catagorical features:        
    - support only imputation with 
        - Logistic Regression, 
        - K-NN and 
            - When using K-NN, 
                - we will first label encode our categorical features to integers, 
                - use these labels to predict our missing values, 
                - and finally map the labels back to their original values.
        - mode imputation. 
    - Depending on the handling method chosen, the handle function calls the required functions from within its class to then manipulate the data with the help of various Sklearn packages: 
        - the _impute function will be in charge of K-NN, mean, median and mode imputation, 
        - _lin_regression_impute and log_regression_impute will perform imputation through prediction, 
        - the role of _delete is self-explanatory.

In [None]:
class MissingValues:
    # Function for handling missing values in the data
    def handle(df, missing_num='auto', missing_categ='auto', _n_neighbors=3):
        count_missing = df.isna().sum().sum()
        if count_missing != 0:
            # drop rows containing only missing values
            df = df.dropna(how='all')
            df.reset_index(drop=True)
            
            if self.missing_num:
                # automated handling of numerical missing values
                if missing_num == 'auto':
                    missing_num = 'linreg'
                    lr = LinearRegression()
                    df = MissingValues._lin_regression_impute(self, df, lr)
                    missing_num = 'knn'
                    imputer = KNNImputer(n_neighbors=_n_neighbors)
                    df = MissingValues._impute(self, df, imputer, type='num')
                # linear regression imputation
                elif missing_num == 'linreg':
                    lr = LinearRegression()
                    df = MissingValues._lin_regression_impute(self, df, lr)
                # knn imputation
                elif missing_num == 'knn':
                    imputer = KNNImputer(n_neighbors=_n_neighbors)
                    df = MissingValues._impute(self, df, imputer, type='num')
                # mean, median or mode imputation
                elif missing_num in ['mean', 'median', 'most_frequent']:
                    imputer = SimpleImputer(strategy=self.missing_num)
                    df = MissingValues._impute_missing(self, df, imputer, type='num')
                # delete missing values
                elif missing_num == 'delete':
                    df = MissingValues._delete(self, df, type='num')
                   
            if missing_categ:
                ...
        else:
            pass
        return df

In [None]:
class MissingValues:
    def handle(df, missing_num='auto', missing_categ='auto', _n_neighbors=3):
        ...
    def _impute(df, imputer, type):
        ...
    def _lin_regression_impute(df, model):
        ...
    def _log_regression_impute(df, model):
        ...
    def _delete(df, type):
        ...

# Dealing with Outliers

- Outliers are extremely large or small values relative to the other points of dataset. 
    - Their existence dramatically affects mathematical models’ performance.
- Outliers can be dangerous as they can skew your model and give you predictions that are biased and erroneous
- Use the describe function and look at information such as maximum and mean.
- Any data value that lies more than (1.5 * IQR) away from the Q1 and Q3 quartiles is considered an outlier.
    - The values 1.5 x IQR (interquartile range) higher / smaller than Q3 / Q1 are called outliers. IQR is the difference between Q3 and Q1 (IQR = Q3-Q1).

##### Numerical Data

In [None]:
airbnb['price'].describe()

##### Or we can Manually filter out what we define as being Outliers

In [None]:
df_cleaned = df_cleaned[df_cleaned['price'].between(999.99, 99999.00)]
df_cleaned = df_cleaned[df_cleaned['year'] > 1990]
df_cleaned = df_cleaned[df_cleaned['odometer'] < 899999.00]df_cleaned.describe().apply(lambda s: s.apply(lambda x: format(x, 'f')))

##### Formular for getting the IQR

In [None]:
def number_of_outliers(df):
    
    df = df.select_dtypes(exclude = 'object')
    
    Q1 = df.quantile(0.25)
    Q3 = df.quantile(0.75)
    IQR = Q3 - Q1
    
    return ((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR))).sum()

##### Outlier detection: 

The zscore() function in NumPy can be used to calculate the z-score of each value in a dataset, which can be used to identify outliers.

In [1]:
import numpy as np

# Create a NumPy array with values
data = np.array([10, 15, 20, 25, 100])

# Calculate the z-score for each value
z_scores = np.abs((data - np.mean(data)) / np.std(data))

# Identify outliers based on a threshold
threshold = 2.5
outliers = data[z_scores > threshold]

# Print the outliers
print(outliers)

[]


##### Treating outliers

- Treat outliers is to make them equal to Q3 or Q1. 
- By using pandas and numpy libraries, the below function does this task. 
- Here, lower_upper_range function finds the range whose outside are outliers. 
- Then with numpy clip function the values are clipped to the ranges.

In [None]:
def lower_upper_range(datacolumn):
    sorted(datacolumn)
    Q1,Q3 = np.percentile(datacolumn , [25,75])
    IQR = Q3 - Q1
    lower_range = Q1 - (1.5 * IQR)
    upper_range = Q3 + (1.5 * IQR)
    return lower_range,upper_range
  
for col in columns:  
    lowerbound,upperbound = lower_upper_range(df[col])
    df[col]=np.clip(df[col],a_min=lowerbound,a_max=upperbound)

### Visualisng  Outliers

##### Numerical Data

- You can plot a box-plot chart to see the 
    - Max
    - Min
    - Mean
    - Median?
    - IQR 1
    - IQR 2

In [None]:
plt.figure(figsize=(10, 6))
airbnb.boxplot(column=['price'])

##### Catagorical Data 
 
- you can plot a bar chart to see whether a particular category to view the count of the categories.
- Outliers in categorical data is tricky, because you have to determine whether it’s appropriate to call it an outlier based on context.

In [None]:
plt.figure(figsize=(10, 6))
airbnb['neighbourhood_group'].value_counts().plot.bar()

### Pipeline will follow the strategy
Ask ourselves: 
- when do we consider a value to be an outlier? 
- For our pipeline, we will use a commonly applied rule that says that 
    - a data point can be considered an outlier if is outside the following range:
        - [Q1 — 1.5 * IQR ; Q3 + 1.5 * IQR]
    - where Q1 and Q3 are the 1st and the 3rd quartiles and IQR is the interquartile range.
    
- There are various strategies to handle Outliers, and we will focus on the following two: 
    - winsorization and
        - When using winsorization, we will again use our above defined range to replace outliers:
            - values > upper bound will be replaced by the upper range value and
            - values < lower bound will be replaced by the lower range value.
    - deletion.

In [None]:
class Outliers:
    # Function that handles outliers in the data
    def handle(df, outliers='winz'):
        if outliers:
            if outliers == 'winz':  
                df = Outliers._winsorization(self, df)
            elif ourliers == 'delete':
                df = Outliers._delete(self, df)
        return df     
    def _winsorization(df):
        ...
    def _delete(df):
        ...
    def _compute_bounds(df, feature):
        ...

### Calculating Outliers and thier types

In [None]:
import numpy as np
import pandas as pd
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
from sklearn.cluster import DBSCAN

# Create sample data
data = {
    'Feature1': [1, 2, 3, 4, 1000],
    'Feature2': [5, 6, 7, 8, 2000],
    'Context': ['A', 'A', 'A', 'A', 'B']
}
df = pd.DataFrame(data)

In [None]:
# Global Outliers
global_outliers = []
for feature in df.columns:
    z_scores = (df[feature] - df[feature].mean()) / df[feature].std()
    global_outliers.extend(df[abs(z_scores) > 3].index)

In [None]:
# Contextual Outliers
contextual_outliers = []
for context in df['Context'].unique():
    context_data = df[df['Context'] == context]
    for feature in df.columns[:-1]:
        z_scores = (context_data[feature] - context_data[feature].mean()) / context_data[feature].std()
        contextual_outliers.extend(context_data[abs(z_scores) > 3].index)

In [None]:
# Collective Outliers using DBSCAN
X = df.drop('Context', axis=1)
dbscan = DBSCAN(eps=0.3, min_samples=2)
dbscan.fit(X)
collective_outliers = np.where(dbscan.labels_ == -1)[0]

# Print the outliers
print("Global Outliers:", global_outliers)
print("Contextual Outliers:", contextual_outliers)
print("Collective Outliers:", collective_outliers)

# Noisy Data

- Noise unwanted/meaningless data items, 
    - features or records which don’t help in explaining the feature itself, or the relationship between feature & target. 
- The occurrences of noisy data in data set can significantly impact prediction of any meaningful information and causes the algorithms to miss out patterns in the data. 
- Noise in data set dramatically led to decreased classification accuracy and poor prediction results. It can be — certain anomalies in features & target, irrelevant/weak features and noisy records.

### Data Binning

- Use the most when pre-processing my data because an important aspect of data preparation is ensuring the data is meaningful and easy to understand.

##### Numerical data 

- Data binning allows you to split continuous numerical data into bins for ease of grouping and visualization.
- to create an evenly spaced sequence for continuous, numerical data is to use np.linspace()

##### Catagorical data

- Categorical data can also be binned in a way to add meaning to data which has arbitrary values that mean something else based on a business requirement specification or index. I would refer to this as data mapping or flagging.

In [None]:
bins = np.linspace(min(df["losses"]), max(df["losses"]), 4)

group_names = ['Low', 'Medium', 'High']

df['losses-binned'] = pd.cut(df['losses'], bins, labels=group_names, include_lowest=True )

##### This new binned column can be visualized using matplotlib

In [None]:
%matplotlib inline
import matplotlib as plt
from matplotlib import pyplot
plt.pyplot.bar(group_names,df["losses"].value_counts())

# set x/y labels and plot title
plt.pyplot.xlabel("losses")
plt.pyplot.ylabel("count")
plt.pyplot.title("losses binned")

### Get discrete intervals from numerical values

- Use to converting numerical values in a column to discrete intervals based on the range specified
- Use pd.cut if you want to convert a continuous variable to a categorical variable.
    - have a rating column that consists of numerical values from 1–10. 
    - What if we want to convert these rating values to certain groups within given their values and specified range? 
        - We can bin the values in discrete intervals and label them as bad, moderate, good, strong given the range specified using pd.cut.
    - convert ages to groups of age ranges where you can categorize each age to a label.

In [None]:
def get_discrete_intervals_from_values(df):
    '''
    AIM    -> get discrete intervals by binning values to a range
     
    INPUT  -> df
    
    OUTPUT -> updated df with discrete intervals based on numerical values 
    ------
    '''
    df['rating'] = pd.cut(df['rating'], 
                          bins=[-1,3,5,7,10], 
                          labels=['bad','moderate','good','strong'])
    return df

### Using Binning to handle Noisy Data

In [None]:
import pandas as pd
import numpy as np

# Create sample DataFrame
data = {
    'Age': [25, 30, 35, 40, 45, 50, 55, 60, 65, 70],
    'Income': [50000, 60000, 70000, 80000, 90000, 100000, 110000, 120000, 130000, 140000]
}
df = pd.DataFrame(data)

# Binning
df['AgeBin'] = pd.cut(df['Age'], bins=[0, 30, 40, 50, 100], labels=['Young', 'Adult', 'Middle-aged', 'Senior'])

print(df)

### Using Regression to handle Noisy Data

In [None]:
from sklearn.linear_model import LinearRegression

# Create sample data
X = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)
y = np.array([2, 4, 6, 8, 10])

# Regression
regression = LinearRegression()
regression.fit(X, y)

# Predict
X_new = np.array([6, 7, 8]).reshape(-1, 1)
predictions = regression.predict(X_new)

print(predictions)

### Using Clustering to handle Noisy Data

In [None]:
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

# Create sample data
X, _ = make_blobs(n_samples=100, centers=3, random_state=0)

# Clustering
kmeans = KMeans(n_clusters=3, random_state=0)
kmeans.fit(X)

# Predict
new_data = np.array([[0, 0], [4, 4]])
predictions = kmeans.predict(new_data)

print(predictions)

# Functions for Data Visualization
- Human brain is very good at identify patterns.
- Visualizing your dataset during the EDA process and identifying the patterns can be very beneficial.
    - Histograms make analyzing the distribution of the data an easier task;
    - Boxplot is great for identifying outliers; 
    - Scatter plot is very useful when it comes to checking the correlations between two variables.

### Creating group plots for each features at once

##### Numeric values 
- look at the distributions of columns with numerical values.

In [None]:
def histograms_numeric_columns(df, numerical_columns):
    '''
    Takes df, numerical columns as list
    Returns a group of histagrams
    '''
    f = pd.melt(df, value_vars=numerical_columns) 
    g = sns.FacetGrid(f, col='variable',  col_wrap=4, sharex=False, sharey=False)
    g = g.map(sns.distplot, 'value')
    return g

In [None]:
edab.histograms_numeric_columns(df, numerical_columns)

##### Heatmaps

- To check the correlation between your dependent and independent variables.
- heatmaps can be visually cluttered if you have too many features. 
    - One way to avoid it is to create a heatmap just for the dependent variable (target) and independent variables (features). 

In [None]:
def heatmap_numeric_w_dependent_variable(df, dependent_variable):
    '''
    Takes df, a dependant variable as str
    Returns a heatmap of all independent variables' correlations with dependent variable 
    '''
    plt.figure(figsize=(8, 10))
    g = sns.heatmap(df.corr()[[dependent_variable]].sort_values(by=dependent_variable), 
                    annot=True, 
                    cmap='coolwarm', 
                    vmin=-1,
                    vmax=1) 
    return g

In [None]:
edab.heatmap_numeric_w_dependent_variable(df, dependent_variable)

# Changing Data Types

- good practice to check data type of all columns with pandas dtype function.
- memory usage of different data types and fasten the processes by choosing right choice of type.
- Reading the data dictionary is very illuminating during this step.

### Transform categorical features into numerical (ordinal) features

##### transformer
- that will transform each str in a list into a int, where the int is the index of that element in the list.

In [None]:
def categorical_to_ordinal_transformer(categories):
    '''
    Returns a function that will map categories to ordinal values based on the
    order of the list of `categories` given. Ex.

    If categories is ['A', 'B', 'C'] then the transformer will map 
    'A' -> 0, 'B' -> 1, 'C' -> 2.
    '''
    return lambda categorical_value: categories.index(categorical_value)

##### Second function has two parts:

- first part, it takes a dictionary of the following form

In [None]:
categorical_numerical_mapping = {
    'Utilities': ['ELO', 'NoSeWa', 'NoSewr', 'AllPub'],
    'Exter Qual': ['Po', 'Fa', 'TA', 'Gd', 'Ex'],
    'Exter Cond': ['Po', 'Fa', 'TA', 'Gd', 'Ex']
}

- Prev function it turns the dictionary into this

In [None]:
transformers = {'Utilities': <utilties_transformer>,
                'Exter Qual': <exter_qual_transformer>,
                'Exter Cond': <exter_cond_transfomer>}

- second part of the function uses the .map() method to map each transformer function onto the dataframe. 
    - Note that a copy of the original dataframe will be created

In [None]:
def transform_categorical_to_numercial(df, categorical_numerical_mapping):
    '''
    Transforms categorical columns to numerical columns
    Takes a df, a dictionary 
    Returns df
    '''
    transformers = {k: categorical_to_ordinal_transformer(v) 
                    for k, v in categorical_numerical_mapping.items()}
    new_df = df.copy()
    for col, transformer in transformers.items():
        new_df[col] = new_df[col].map(transformer).astype('int64')
    return new_df

### Convert categorical variable to numerical variable 2.0

- Machine learning models require variables to be in numerical format. 
- This is when we need to convert categorical variables to numerical variables before feeding them to the models.

In [None]:
def convert_cat2num(df):
    # Convert categorical variable to numerical variable
    num_encode = {'col_1' : {'YES':1, 'NO':0},
                  'col_2'  : {'WON':1, 'LOSE':0, 'DRAW':0}}  
    df.replace(num_encode, inplace=True)

### Pipeline will follow the strategy

##### Catagorical encoding

- to perform computations with categorical data, in most cases we need our data to be of a 
    - numeric 
        - type i. e. numbers, or integers. Therefore, 
- common techniques consist of 
    - one-hot encoding data, or 
        -  One-hot encoding of data represents each unique value of a feature as a binary vector
    - label encoding data.
        - label encoding assigns a unique integer to each value.
        
- There are various pros and cons for each of the methods, 
    - one-hot encoding
        - produces a lot of additional features
    - label encoding, 
        - the labels might be interpreted by certain algorithms as mathematically dependent: 
            - 1 apple + 1 orange = 1 banana, 
        - which is obviously a wrong interpretation of this type of categorical data.
        
- set the default strategy ‘auto’ to perform the encoding according to the following rules:
    - if the feature contains < 10 unique values, it will be one-hot-encoded
    - if the feature contains < 20 unique values, it will be label-encoded
    - if the feature contains > 20 unique values, it will not be encoded

In [None]:
class EncodeCateg:
    # Function for encoding of categorical features
    # to specify columns set encode_categ to: ['auto', ['col1', col2']]
    def handle(df, encode_categ=['auto']):
        if encode_categ[0]:
            # select non numeric features
            cols_categ = set(df.columns) ^ set(df.select_dtypes(include=np.number).columns)
            # check if all columns should be encoded
            if len(encode_categ) == 1:
                target_cols = cols_categ # encode ALL columns
            else:
                target_cols = encode_categ[1] # encode only specific columns
            for feature in target_cols:
                if feature in cols_categ:
                    feature = feature # columns are column names
                else:
                    feature = df.columns[feature] # columns are indexes
                try:
                    # skip encoding of datetime features
                    pd.to_datetime(df[feature])
                except:
                    try:
                        if encode_categ[0] == 'auto':
                            # ONEHOT encode if not more than 10 unique values to encode
                            if df[feature].nunique() <=10:
                                df = EncodeCateg._to_onehot(df, feature)
                            # LABEL encode if not more than 20 unique values to encode
                            elif df[feature].nunique() <=20:
                                df = EncodeCateg._to_label(df, feature)
                            # skip encoding if more than 20 unique values to encode
                        elif encode_categ[0] == 'onehot':
                            df = EncodeCateg._to_onehot(df, feature)
                        elif encode_categ[0] == 'label':
                            df = EncodeCateg._to_label(df, feature)
                    except:
                        pass
        return df
    def _to_onehot(df, feature, limit=10):
        ...
    def _to_label(df, feature):
        ...

- The handle function takes a list as input, whereas the features we want to manually encode can be defined by column names or indexes as following:

In [None]:
encode_categ = [‘onehot’, [‘column_name’, 2]]

##### Data conversion: 

The astype() function in pandas can be used to convert columns to a specific data type.

In [None]:
import pandas as pd

# Create a DataFrame with columns of different data types
data = {'col1': [1, 2, 3],
        'col2': ['A', 'B', 'C'],
        'col3': [True, False, True]}
df = pd.DataFrame(data)

# Convert col1 to float data type
df['col1'] = df['col1'].astype(float)

# Print the updated DataFrame
print(df.dtypes)

# Change data type to reduce memory

- Changing data type is common if you want to reduce memory usage.
    - use the astype(‘dtype’) function where you specify the dtype you want.
- changed the data type for the host_id column from int64 to int32
- Observe the memory before changing the data type

In [None]:
airbnb['host_id'] = airbnb['host_id'].astype('int32')
airbnb.info()

### Change dtypes

- When a dataset gets larger, we need to convert the dtypes in order to save memory.

In [None]:
def change_dtypes(col_int, col_float, df): 
    '''
    AIM    -> Changing dtypes to save memory
     
    INPUT  -> List of column names (int, float), df
    
    OUTPUT -> updated df with smaller memory  
    ------
    '''
    df[col_int] = df[col_int].astype('int32')
    df[col_float] = df[col_float].astype('float32')
    df["Date"] = df["Date"].astype("datetime64[ns]")
    df["Payment"] = df["Payment"].str[1:].str.replace(",", ".").astype("float")
    return df

### Converting String column to numeric Column

In [None]:
# Select all numeric attribute columns, i.e. excluding "word-type" columns such as Nationality.
cols = ['Overall', 'Acceleration', 'Aggression',
       'Agility', 'Balance', 'Ball control', 'Composure', 'Crossing', 'Curve',
       'Dribbling', 'Finishing', 'Free kick accuracy', 'GK diving',
       'GK handling', 'GK kicking', 'GK positioning', 'GK reflexes',
       'Heading accuracy', 'Interceptions', 'Jumping', 'Long passing',
       'Long shots', 'Marking', 'Penalties', 'Positioning', 'Reactions',
       'Short passing', 'Shot power', 'Sliding tackle', 'Sprint speed',
       'Stamina', 'Standing tackle', 'Strength', 'Vision', 'Volleys']

In [None]:
# Create function.
def to_float(x):    
    "Transforms attribute columns to type float"
    
    if type(x) is int:
        return float(x)
    else:
        return float(x[0:2])

In [None]:
# Use applymap() function to transform all selected columns.
df[cols] = df[cols].applymap(to_float)

### Changing int column 2.0

In [None]:
floatage = lambda x: float(x)

df["Age"][df["Age"] > 36].apply(floatage)

### Convert Percentage String to Numeric

- Col is displayed as percentages and treated as strings. 
- fine when presenting the table as a report but will be impossible for us to perform any meaningful 
- Mathematic operations or analysis on them as they are not numeric variables.
- **Solution**
    - 1st use pandas.Series.str.rstrip() method to remove the trailing ‘%’ character and then 
    - 2nd use astype(float) to convert it to numeric. 
    - use Series.str.lstrip() 
        - to remove leading characters in series and 
    - use Series.str.strip() 
        - to remove both leading and trailing characters in series.

In [None]:
df[column] = df[column].str.rstrip("%").astype(float)/100

### Change the decimal places, say to 2 decimal points

In [None]:
pd.options.display.float_format = '{:,.2f}'.format

### Convert Numeric to Percentage String
- How to do this vice versa — to convert the numeric back to the percentage string? 
- To convert it back to percentage string
    - use python’s string format syntax '{:.2%}’.format to add the ‘%’ sign back. 
    - Then we use python’s map() function to iterate and apply the formatting to all the rows in the specific column.

In [None]:
df.loc[:, column] = df[column].map('{:.2%}'.format)

# Data normalization: 

The minmax_scale() function in SciPy can be used to normalize a dataset so that all values are between 0 and 1.

In [None]:
rom scipy import stats

# Create a NumPy array with values
data = np.array([10, 20, 30, 40, 50])

# Normalize the data between 0 and 1
normalized_data = stats.minmax_scale(data)

# Print the normalized data
print(normalized_data)

# String Manipulation

### Remove strings in columns

- When you’d face the new line character or other weird symbols that appear in your columns of strings. 
- This could easily be dealt with using df['col_1'].replace where col_1 is one of the columns in the dataframe df.

In [None]:
# Syntax

string.replace("old", "new", count)

# Where Count is the number of values you want to replace. if not specified, all occurance will be replaced

In [1]:
string = "I have to continue pushing and strive to be the best"

In [2]:
string.replace("pushing", "working")

'I have to continue working and strive to be the best'

In [None]:
def remove_col_str(df):
    # remove a portion of string in a dataframe column - col_1
    df['col_1'].replace('\n', '', regex=True, inplace=True)
    
    # remove all the characters after &# (including &#) for column - col_1
    df['col_1'].replace(' &#.*', '', regex=True, inplace=True)

### Remove white space in columns

In [None]:
# Syntax

String.strip("Char")

# There also the following
# lstrip() - Remove at the begining
# rstrip() - Remove at the end

In [9]:
string2 = "  n,s    I have to complete these and moor Python "

In [10]:
string2.strip(" s,n")

'I have to complete these and moor Pytho'

In [None]:
def remove_col_white_space(df,col):
    # remove white space at the beginning of string 
    df[col] = df[col].str.lstrip()

### Creating two cols from a Single column

##### Case
- represent them with all upper case or lower case letters. 
- Another option is to just capitalize them (i.e. only the first letter is upper case).

##### Order
- other issue is to switch the order of the last and first names. First name and then last name is a more standard representation.
    - prefer is to first split the values at comma
    - take the second column (i.e. 1) and combine it with the first column (i.e. 0) with a space in between

### Splitting string with an identified Saperator

In [None]:
# Syntax

string.split("Separator", maxsplit = i)

# where the sep is the string where the split will occur
# where maxsplit indicates how many splits to do

In [14]:
text = "I have so much to do, its not even funny"

In [13]:
text.split()

['I', 'have', 'so', 'much', 'to', 'do', 'its', 'not', 'even', 'funny']

In [15]:
text.split(",")

['I have so much to do', ' its not even funny']

### Concatenate two columns with strings (with condition)

- When you want to combine two columns with strings conditionally. 
- For instance, you want to concatenate the 1st column with the 2nd column if the strings in the 1st column end with certain letters. 
- The ending letters can also be removed after the concatenation, depending on your needs.

In [None]:
def break_it_apart(df, col, sep):
    df.col.str.split(sep, expand=True)
    df["Name"] = (df.col.str.split(",", expand=True)[1] + " " + 
                  df.col.str.split(",", expand=True)[0]).str.lower()

In [None]:
def concat_col_str_condition(df):
    # concat 2 columns with strings if the last 3 letters of the first column are 'pil'
    mask = df['col_1'].str.endswith('pil', na=False)
    col_new = df[mask]['col_1'] + df[mask]['col_2']
    col_new.replace('pil', ' ', regex=True, inplace=True)  # replace the 'pil' with emtpy space


### Joining/ Concatenating Stings in a list

In [None]:
# Syntax

string.join(seq)

# where the string is the char that the sequence will join based on
# where sequence is the sequence of string elements to be joined 

In [17]:
ls = ["Sip", "is", "going", "to", "be", "rich"]

In [18]:
" ".join(ls)

'Sip is going to be rich'

### Must Know String methods

- **Capitalize()** : Converts the first charactor into upper case
- **Casefold()/lower()** : Converts a string into lowecase
- **Count()** : Count the number of times a value occurs in a string
- **Endswith()** : Return True if the string ends with specific value
- **Strartswith()** : Returns True is the string starts with a specific value
- **Index()** : Returns the postion of a string

### Output a dataframe based on unique last strings in a column

- Using back the dataframe with all customers’ information and adding a column of timestamp for each id.
- Now each id is not unique and it’s repeated throughout their respective time period. 
    - For each id you want to get the last row of the id because you only care about the final customers’ information for each id.
    - To drop all duplicated id and just keep the last row. 
        - This makes sure that we always get the final customers’ information with unique id

In [None]:
def get_unique_last_str(df):
    '''
    AIM    -> get unique last str for a column
     
    INPUT  -> df
    
    OUTPUT -> updated df based on the unique last strings in a column 
    ------
    '''
    df = df[df.index.isin(df['col_1'].drop_duplicates(keep='last').index)].reset_index(drop=True)
    return df

##### Replace all values that do not contain a specific string in pandas dataframe

In [None]:
df['col'] = np.where(~df['col'].str.contains('land', na = False), 'other', df['col'])

In [None]:
df['col'] = np.where(df['col'].str.contains('land') == True, "other", df['col'])

In [None]:
df['col'] = np.where(df['col'].str.contains('land'), df[''], 'other')

# Inconsistent data/Irrelevant features

- Inconsistent data means that the unique classes of a column have different representations.
- Inconsistent data refers to things like 
    - spelling errors in your data, 
    - column names that are not relevant to the data, the 
    - wrong data type
    
- there is no automation for this task, hence we need to analyze the classes manually. 
    - unique function of pandas is for this purpose.

In [None]:
df['CarName'] = df['CarName'].str.split().str[0]
print(df['CarName'].unique())

- use pandas loc function to solve it

In [None]:
df.loc[df['CarName'] == 'maxda', 'CarName'] = 'mazda'
df.loc[df['CarName'] == 'Nissan', 'CarName'] = 'nissan'
df.loc[df['CarName'] == 'porcshce', 'CarName'] = 'porsche'
df.loc[df['CarName'] == 'toyouta', 'CarName'] = 'toyota'
df.loc[df['CarName'] == 'vokswagen', 'CarName'] = 'volkswagen'
df.loc[df['CarName'] == 'vw', 'CarName'] = 'volkswagen'

# Regex

- characters in a column that need to be removed
- to remove non-alphanumerical characters (e.g. ?, !, -, ., and so on)
- replace function can be used in this case too because it accepts a regular expression (i.e. regex)
- Use this If we only want alphabetical characters

In [None]:
def remove_non_alpha_char(df, col):
    # to remove non-alpha + numerical characters (e.g. ?, !, -, ., and so on)
    # to continue having only alphabetical characters
    df.col.str.replace('[^a-zA-Z]', '').str.lower()

In [None]:
def remove_non_alphanumerical_char(df, col):
    # to keep both alphabetical and numeric (i.e. alphanumerical), we need to add the numbers in our regex
    df.col.str.replace('[^a-zA-Z0-9]', '').str.lower()

### Finding Correctly Formatted strings stored in a specific way and contains specific characters

##### Use Regular Expression (regex)
- Helps with searching for common string patterns
- package re
- Includes alot of functions to deal with Regular expressions

##### metacharacters
- Control Characters of RegEx Engine and are interpreted in a special way.
    - ^ : Checks if string begins with a specific value
    - (period sign (.) : Matches a single char except for new line
    - astericks sign (*) : Matches 0 or more occurances of the preceeding patterns
    - [...] : matches any single char in brackets
    - [^...] : match any single char not in brackets
    - {...} : match exactly the number of occurances
    - (...) : Used to group sub patterns
    - (plus sign) + : matches one or more occurances
    - | : matches either values
    - \ : Escape special characters. Put a splash before the char that you are unsure whether it has special meaning or not.

In [None]:
# Syntax

re.search()

# Scan string for matching patterns then retunr the location where the patterns match

In [19]:
import re

In [20]:
string3 = "Siphamandla mandindi"

In [21]:
re.search("man", string3)

<re.Match object; span=(5, 8), match='man'>

In [23]:
string4 = "Superman3539"

In [24]:
re.search("[0-9][0-9][0-9]", string4)

<re.Match object; span=(8, 11), match='353'>

In [None]:
# Syntax

re.findall()

# returns a list containing all matches

# note
# d+

In [25]:
string5 = "Sip is 28, and Ayanda is 27"
pattern = "\d+"
res = re.findall(pattern, string5)

In [26]:
res

['28', '27']

In [None]:
# Syntax

re.split()

# Devides string into pieces where patterns occur and return list of the string

In [27]:
string5 = "Sip is 28, and Ayanda is 27"
pattern = "^is"
res = re.split(pattern, string5)

In [28]:
res

['Sip is 28, and Ayanda is 27']

In [None]:
# Syntax

re.sub()

# returns a string where matched char are replaced with new specifies values

In [30]:
string6 = "today is a great day"
pattern = "^to"
replace = "Wednes"
result = re.sub(pattern, replace, string6)
result

'Wednesday is a great day'

### Data cleaning using regular expressions: 

python re module can be used to perform data cleaning operations using regular expressions.

In [None]:
import re

# Define a string with dirty data
data = "abc123def456ghi"

# Remove all numbers from the string
clean_data = re.sub(r'\d+', '', data)

# Print the cleaned data
print(clean_data)

### Remove rows based on regex

- Choose a word as my target, and I used the function str.contains() to find the indexes that contain those rows.
- using the drop function, and setting axis to index, I can supply the indexes I have and drop those rows.
- Printing out the number of rows, you can see it reduced by three.

In [None]:
# example: remove rows that contain the target word
target = '[Nn]oisy'

noisy_airbnb = airbnb[airbnb['name'].str.contains(target, regex=True)]

# show rows that contains the word noisy
print(noisy_airbnb['name'])

# get the index that contains the word noisy
index_to_drop = noisy_airbnb['name'].index

# print(index_to_drop)

In [None]:
# drop rows based on index
airbnb.drop(index_to_drop, axis='index', inplace=True)

print(len(airbnb_ori))
print(len(airbnb))

# Syntax errors

### Fixing typos
- Solution to Correcting different words but with the same meaning.

In [None]:
dataframe['gender'].map({'m': 'male', 'fem.': 'female'})
re.sub(r"\^m\$", 'Male', 'male', flags=re.IGNORECASE)

### Spelling errors in categorical data

- Sometimes your categorical data might have spelling errors or different capitalization that can mess up your categorization.
- fix this by using the replace function in pandas. 
- We first give the values that are wrong, then supply the right ones.

In [None]:
airbnb['neighbourhood_group'].value_counts()

In [None]:
wrong_spelling = ['manhatann', 'brookln']

# replace them with the wrong spelling
airbnb.loc[random_index,'neighbourhood_group'] = wrong_spelling
airbnb['neighbourhood_group'].value_counts()

In [None]:
airbnb['neighbourhood_group'].replace(['manhatann', 'brookln'],
                             ['Manhattan', 'Brooklyn'], inplace=True)
airbnb['neighbourhood_group'].value_counts()

# Data Transformation

### Invalid Data

- values which are simply not logically correct
- reasons leading to invalid data:
    - Data collection errors
        - The data engineer can type 1799 instead of 179 for height column. 
        - This kind of random mistakes can be taken as null value and imputed alongside other NAs.

    - Data manipulation errors
        - Some columns of dataset can be output of functions coded by developers. 
        - For example, a function calculates age from birthdate and answers are negative. It means the equation is incorrect.

In [None]:
# Create function.
def position_type(s):
    
    """"This function converts the individual positions (abbreviations) and classifies it
    as either a forward, midfielder, back or goalkeeper"""
    
    if (s[-2] == 'T') | (s[-2] == 'W'):
        return 'Forward'
    elif s[-2] == 'M':
        return 'Midfielder'
    elif s[-2] == 'B':
        return 'Back'
    else:
        return 'GoalKeeper'

In [None]:
# Create position type column.
df['Preferred Positions Type'] = df['Preferred Positions'].apply(position_type)

# Look at first 5 entries.
df['Preferred Positions Type'].head()

### Convert from its categorical labels to numeric.

- Identify the range of values that a certain feature may contain. 
- Based on the values identified, we can create a function to overwrite each value with numerical values.

Function would take a pandas series as an input and convert the string to a numeric value. 
- We can then apply the function to the series of interest.

In [None]:
df['col'].value_counts()


def garage_qual_cleaner(cell):
    if cell == 'Ex':
        return 5
    elif cell == 'Gd':
        return 4
    elif cell == 'TA':
        return 3
    elif cell == 'Fa':
        return 2
    elif cell == 'Po':
        return 1
    else:
        return 0

Alternatively, map a dictionary to overwrite values without creating a function.

In [None]:
train['kitchen_qual'].map({'Ex': 5, 'Gd': 4, 'TA': 3, 'Fa': 2, 'Po': 1})

### Create a Master funtion

In [None]:
def data_cleaner(df):
    # map numeric values onto all the quality columns using a quality dictionary
    qual_dict = {'Po':0,'Fa':1,'TA':2,'Gd':3,'Ex':4}
    # create a list of ordinal column names 
    ordinal_col_names = [col for col in df.columns if (col[-4:] in ['qual', 'cond']) and col[:3] != 'ove'] # last section ignores "overall quality columns which will be addressed below
    # creating a new feature called age
    df['age'] = df.apply(lambda row: row['yr_sold'] - max(row['year_built'], row['year_remod/add']), axis=1)
    # dummify the date sold column 
    df['date_sold'] = df.apply(lambda row: str(row['mo_sold'])+ '-' + str(row['yr_sold']), axis=1)
    df.loc[:,df.dtypes!= 'object'] = df.loc[:, df.dtypes != 'object'].apply(lambda col: col.fillna(col.mean()))
    
    # transforming columns 
    df[ordinal_col_names] = df[ordinal_col_names].applymap(lambda cell: 2 if pd.isnull(cell) else qual_dict[cell])
    
    return df
# applying the function to train data
train = clean_data(train)

### Data transformation: 

The apply() function in pandas can be used to apply a custom function to each element in a DataFrame.

In [None]:
import pandas as pd

# Create a DataFrame with a column
data = {'col1': [1, 2, 3, 4, 5]}

df = pd.DataFrame(data)

# Square each value in col1 using a custom function
df['col1_squared'] = df['col1'].apply(lambda x: x**2)

# Print the transformed DataFrame
print(df)

### Columnwise Transformation

In [None]:
def col_wise_trans(col_name):
    df = pd.DataFrame()
    for i,j in df[[col_name]].itterrows():
        if j[col_name] != "[]":
            x = j[col_name].replace('=>', ':')
            x_1 = x.replace('\n', '').replace('\\r','')
            d = pd.json_normalize(ast.leteral_eval(x_1))
            df = df.append( d, ingore_index = True)
    return df

In [None]:
col_wise_trans(col_name)

### What about variables without a clear, ordered relationship? 

- Visualize the relationship with our target:

In [None]:
plt.figure(figsize=(35,10)) # adjust the fig size to see everything
sns.boxplot(train['neighborhood'], train['saleprice']).set_title('Sale Price varies widely by Ames Neighborhood')

- Variables that do not have take on ordinal values can converted to a numerical amount by dummifying. 
- Pandas provides a method to get dummify the variables — 
    - for each value (in this case neighborhood) a new feature will be created and the row will have a value of 0 or 1 for that column — 
        - A 1 signifying that in the original string column, a row contained the value that is now in the column name.
- Use the pandas get dummies method to convert these to numeric values. 
- It is important that we call the ‘drop_first’ argument and set it as ‘True.’ 
- This will dummify all variables after dropping the first one. We do this because it is important that a categorical variable of K categories, or levels, usually enters a regression as a sequence of K-1 dummy variables and the dropped variable will serve as our reference category. 
- If a row has a value of 0 for all categories, we know that that observation belonged to the dropped column.



In [None]:
pd.get_dummies(df, columns= ['col'], drop_first = True)

# Timestamps

### Convert timestamp(from string to datetime format)

In [None]:
def convert_str_datetime(df): 
    '''
    AIM    -> Convert datetime(String) to datetime(format we want)
     
    INPUT  -> df
    
    OUTPUT -> updated df with new datetime format 
    ------
    '''
    df.insert(loc=2, column='timestamp', value=pd.to_datetime(df.transdate, format='%Y-%m-%d %H:%M:%S.%f'))

### Output a dataframe without NaN values for a Timestamp column

- Want to output a dataframe with all the available data that a column has.
    - Exlcude rows with NaN.
    - Time series data where the column is timestamp

In [None]:
ef remove_nan_values(df):
    '''
    AIM    -> remove NaN values of a particular column and output the whole dataframe
     
    INPUT  -> df
    
    OUTPUT -> updated df without NaN values for a particular column 
    ------
    '''
    df = df[df['col_1'].notnull()]
    return df

### Pipeline will follow the strategy

##### Extraction of DateTime features

- datetime values, like 
    - timestamps or 
    - dates, 
- want to extract these so they become easier to handle when processing or visualizing later on.
- We can also do this in an automated fashion: 
    - we will let our pipeline search through the features and check whether one of these can be converted into the datetime type. 
    - If yes, then we can safely assume that this feature holds datetime values.
    
- We can define the granularity at which the datetime features are extracted, whereas the default is set to ‘s’ for seconds.
- After the extraction, the function checks whether the entries for dates and times are valid meaning: 
    - if the extracted columns ‘Day’, ‘Month’ and ‘Year’ all contain 0’s, all three will be deleted. 
    - The same happens for ‘Hour’, ‘Minute’ and ‘Sec’.

In [None]:
# Feature for extracting datetime values
def convert_datetime(df, extract_datetime='s'):
    cols = set(df.columns) ^ set(df.select_dtypes(include=np.number).columns) 
    for feature in cols: 
        try:
            # convert features encoded as strings to type datetime ['D','M','Y','h','m','s']
            df[feature] = pd.to_datetime(df[feature], infer_datetime_format=True)
            df['Day'] = pd.to_datetime(df[feature]).dt.day
            if extract_datetime in ['M','Y','h','m','s']:
                df['Month'] = pd.to_datetime(df[feature]).dt.month
                if extract_datetime in ['Y','h','m','s']:
                    df['Year'] = pd.to_datetime(df[feature]).dt.year
                    if extract_datetime in ['h','m','s']:
                        df['Hour'] = pd.to_datetime(df[feature]).dt.hour
                        if extract_datetime in ['m','s']:
                            df['Minute'] = pd.to_datetime(df[feature]).dt.minute
                            if extract_datetime in ['s']:
                              df['Sec'] = pd.to_datetime(df[feature]).dt.second
            try: # check if entries for the extracted dates/times are valid, otherwise drop
                if (df['Hour'] == 0).all() and (df['Minute'] == 0).all() and (df['Sec'] == 0).all():
                    df.drop('Hour', inplace = True, axis =1 )
                    df.drop('Minute', inplace = True, axis =1 )
                    df.drop('Sec', inplace = True, axis =1 )
                elif (df['Day'] == 0).all() and (df['Month'] == 0).all() and (df['Year'] == 0).all():
                    df.drop('Day', inplace = True, axis =1 )
                    df.drop('Month', inplace = True, axis =1 )
                    df.drop('Year', inplace = True, axis =1 )  
            except:
                pass
        except: # feature cannot be converted to datetime
            pass          
return df

### List comprehension
- list comprehension is so elegant and clean that you can abstract your logic in a single line of code without having any for loops, and 
    - compute that much faster than for loops.
- Use list comprehension if I want to get a list of values based on certain conditions to append to the existing dataframe or use that for further analysis.

In [None]:
def list_comprehension(df):
    '''
    AIM    -> IF ELSE for the list comprehension 
    
    INPUT  -> df 
    
    OUTPUT -> List 
    ------
    '''
    compute_list = [df['col_1'][i] if df['col_1'][i] > 0 else -1 for i in range(len(df))]
    return compute_list

### Pipeline will follow the strategy

##### Dataframe Polishing

- some features that were originally of type integer could have been converted to floats, due to imputation techniques or other processing steps that were applied. Before outputting our final dataframe we will convert these values back to integers
- want to round all the float features in our dataset to the same number of decimals as they had in our original input dataset. This is to avoid on one hand unnecessary trailing 0’s in the float decimals, and on the other hand make sure not to round our values more than our original values.

In [None]:
from AutoClean import AutoClean
pipeline = AutoClean(dataset)

pipeline.output

# Speed up your Data Cleaning and Preprocessing with klib

### Relationship between Numeric independant variable vs dependant
-  Pandas’ corrwith() method will return a pair-wise correlation for each numeric variable with the target and ignore non-numeric columns.

In [None]:
correlations = train.corrwith(train['saleprice']).iloc[:-1].to_frame()
correlations['abs'] = correlations[0].abs()
sorted_correlations = correlations.sort_values('abs', ascending=False)[0]
fig, ax = plt.subplots(figsize=(10,20))
sns.heatmap(sorted_correlations.to_frame(), cmap='coolwarm', annot=True, vmin=-1, vmax=1, ax=ax);

### Relationship between Object variables vs dependant
- Ordinal and categorical variables such as ‘exterior condition’ and ‘central air’ intuitively would have a relationship to sale price. It is critical that we visualize this.
- achieve this by generating a box plot that compares the values of an ordinal/categorical and visualize a relationship with sale price.

In [None]:
sns.boxplot(train['central_air'],
        train['saleprice']).set_title('Central Air vs. Sale Price')

In [None]:
sns.boxplot(train['kitchen_qual'], 
            train['saleprice']).set_title('Kitchen Quality vs. Sale Price')

# Data manipulation

### Adding new data

In [2]:
import pandas as pd

# Creating a DataFrame
data = {'Name': ['John', 'Alice', 'Bob', 'Emily'],
        'Age': [25, 30, 28, 35],
        'City': ['New York', 'Paris', 'London', 'Tokyo']}
df = pd.DataFrame(data)

# Adding new data
new_data = {'Name': 'Mark', 'Age': 27, 'City': 'Sydney'}
df = df.append(new_data, ignore_index=True)

In [3]:
df

Unnamed: 0,Name,Age,City
0,John,25,New York
1,Alice,30,Paris
2,Bob,28,London
3,Emily,35,Tokyo
4,Mark,27,Sydney


### Updating data

In [4]:
# Updating data
df.loc[df['Name'] == 'Alice', 'Age'] = 31

In [5]:
df

Unnamed: 0,Name,Age,City
0,John,25,New York
1,Alice,31,Paris
2,Bob,28,London
3,Emily,35,Tokyo
4,Mark,27,Sydney


### Deleting data

In [9]:
# Deleting data
df = df.drop(index=2)

In [8]:
df

Unnamed: 0,Name,Age,City
0,John,25,New York
1,Alice,31,Paris
3,Emily,35,Tokyo
4,Mark,27,Sydney


### Sorting and filtering data

In [10]:
# Sorting and filtering data
df = df.sort_values('Age', ascending=False)
filtered_df = df[df['Age'] > 28]

In [11]:
df

Unnamed: 0,Name,Age,City
3,Emily,35,Tokyo
1,Alice,31,Paris
4,Mark,27,Sydney
0,John,25,New York


In [12]:
filtered_df

Unnamed: 0,Name,Age,City
3,Emily,35,Tokyo
1,Alice,31,Paris


### Joining and merging data

In [13]:
# Joining and merging data
data2 = {'Name': ['John', 'Alice', 'Bob'],
         'Salary': [5000, 6000, 4500]}
df2 = pd.DataFrame(data2)
merged_df = pd.merge(df, df2, on='Name', how='left')

In [14]:
merged_df

Unnamed: 0,Name,Age,City,Salary
0,Emily,35,Tokyo,
1,Alice,31,Paris,6000.0
2,Mark,27,Sydney,
3,John,25,New York,5000.0


### Transforming data

In [15]:
# Transforming data
df['City'] = df['City'].str.upper()

In [16]:
df

Unnamed: 0,Name,Age,City
3,Emily,35,TOKYO
1,Alice,31,PARIS
4,Mark,27,SYDNEY
0,John,25,NEW YORK


In [29]:
# Transform
df['A_squared'] = df['A'].transform(lambda x: x ** 2)

In [30]:
df

Unnamed: 0,A,B,C,D,Bins,A_squared
0,1.015811,6.0,10,15,Low,1.031872
1,2.016815,7.0,11,16,Medium,4.067541
2,0.014207,8.0,12,17,Low,0.000202
3,4.053917,9.0,13,18,Medium,16.434243
4,4.89107,0.0,14,19,Medium,23.92257


### Missing Data

In [17]:
import pandas as pd
import numpy as np

# Create sample DataFrame
data = {
    'A': [1, 2, np.nan, 4, 5],
    'B': [6, 7, 8, 9, np.nan],
    'C': [10, 11, 12, 13, 14]
}
df = pd.DataFrame(data)

# Missing Data
df.fillna(0, inplace=True)  # Fill missing values with 0

In [18]:
df

Unnamed: 0,A,B,C
0,1.0,6.0,10
1,2.0,7.0,11
2,0.0,8.0,12
3,4.0,9.0,13
4,5.0,0.0,14


### Noisy Data

In [19]:
# Noisy Data
df['A'] = df['A'] + np.random.normal(0, 0.1, len(df))  # Add noise to column A

In [20]:
df

Unnamed: 0,A,B,C
0,1.015811,6.0,10
1,2.016815,7.0,11
2,0.014207,8.0,12
3,4.053917,9.0,13
4,4.89107,0.0,14


### Outliers Detection

In [21]:
# Outliers Detection
def detect_outliers(data, threshold=3):
    z_scores = (data - data.mean()) / data.std()
    outliers = np.abs(z_scores) > threshold
    return outliers

In [22]:
outliers = detect_outliers(df['A'])
df['A'][outliers] = np.nan  # Replace outliers with NaN

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['A'][outliers] = np.nan  # Replace outliers with NaN


In [23]:
df

Unnamed: 0,A,B,C
0,1.015811,6.0,10
1,2.016815,7.0,11
2,0.014207,8.0,12
3,4.053917,9.0,13
4,4.89107,0.0,14


### Join / Concatenate

In [24]:
# Join
df2 = pd.DataFrame({'D': [15, 16, 17, 18, 19]})
df = df.join(df2)

In [25]:
df

Unnamed: 0,A,B,C,D
0,1.015811,6.0,10,15
1,2.016815,7.0,11,16
2,0.014207,8.0,12,17
3,4.053917,9.0,13,18
4,4.89107,0.0,14,19


### Melt

In [26]:
# Melt
df_melted = pd.melt(df, id_vars=['D'], value_vars=['A', 'B', 'C'], var_name='Variable', value_name='Value')

In [27]:
df_melted

Unnamed: 0,D,Variable,Value
0,15,A,1.015811
1,16,A,2.016815
2,17,A,0.014207
3,18,A,4.053917
4,19,A,4.89107
5,15,B,6.0
6,16,B,7.0
7,17,B,8.0
8,18,B,9.0
9,19,B,0.0


### Cut

In [28]:
# Cut
df['Bins'] = pd.cut(df['A'], bins=[0, 2, 5, np.inf], labels=['Low', 'Medium', 'High'])
df

Unnamed: 0,A,B,C,D,Bins
0,1.015811,6.0,10,15,Low
1,2.016815,7.0,11,16,Medium
2,0.014207,8.0,12,17,Low
3,4.053917,9.0,13,18,Medium
4,4.89107,0.0,14,19,Medium


### Clean

In [31]:
# Clean
df.dropna(inplace=True)  # Remove rows with missing values

In [32]:
df

Unnamed: 0,A,B,C,D,Bins,A_squared
0,1.015811,6.0,10,15,Low,1.031872
1,2.016815,7.0,11,16,Medium,4.067541
2,0.014207,8.0,12,17,Low,0.000202
3,4.053917,9.0,13,18,Medium,16.434243
4,4.89107,0.0,14,19,Medium,23.92257


### Slicing

In [33]:
# Slicing
df_sliced = df.iloc[1:3, 1:4]  # Select rows 1 and 2, columns 1, 2, and 3
df_sliced

Unnamed: 0,B,C,D
1,7.0,11,16
2,8.0,12,17


### Reshaping

In [34]:
# Reshaping
df_pivoted = df.pivot(index='D', columns='Bins', values='A')
df_pivoted

Bins,Low,Medium
D,Unnamed: 1_level_1,Unnamed: 2_level_1
15,1.015811,
16,,2.016815
17,0.014207,
18,,4.053917
19,,4.89107


### Filter

In [36]:
# Filter
df_filtered = df[df['A'] > 2]  # Filter rows where column A > 2
df_filtered

Unnamed: 0,A,B,C,D,Bins,A_squared
1,2.016815,7.0,11,16,Medium,4.067541
3,4.053917,9.0,13,18,Medium,16.434243
4,4.89107,0.0,14,19,Medium,23.92257


### Group by

In [37]:
# Group by
df_grouped = df.groupby('Bins').mean()  # Calculate the mean for each group
df_grouped

Unnamed: 0_level_0,A,B,C,D,A_squared
Bins,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Low,0.515009,7.0,11.0,16.0,0.516037
Medium,3.653934,5.333333,12.666667,17.666667,14.808118
High,,,,,


### Label Encoding

In [38]:
# Label Encoding
df['Bins_encoded'] = df['Bins'].astype('category').cat.codes

In [39]:
df

Unnamed: 0,A,B,C,D,Bins,A_squared,Bins_encoded
0,1.015811,6.0,10,15,Low,1.031872,0
1,2.016815,7.0,11,16,Medium,4.067541,1
2,0.014207,8.0,12,17,Low,0.000202,0
3,4.053917,9.0,13,18,Medium,16.434243,1
4,4.89107,0.0,14,19,Medium,23.92257,1


### Pivot and Merge

In [40]:
# Pivot and Merge
df_pivot = df.pivot(index='D', columns='Bins', values='A')
df_merge = pd.merge(df, df_pivot, on='D')
df_merge

Unnamed: 0,A,B,C,D,Bins,A_squared,Bins_encoded,Low,Medium
0,1.015811,6.0,10,15,Low,1.031872,0,1.015811,
1,2.016815,7.0,11,16,Medium,4.067541,1,,2.016815
2,0.014207,8.0,12,17,Low,0.000202,0,0.014207,
3,4.053917,9.0,13,18,Medium,16.434243,1,,4.053917
4,4.89107,0.0,14,19,Medium,23.92257,1,,4.89107


### Concatenate

In [42]:
# Concatenate
df_concat = pd.concat([df, df_pivot], axis=1)
df_concat

Unnamed: 0,A,B,C,D,Bins,A_squared,Bins_encoded,Low,Medium
0,1.015811,6.0,10.0,15.0,Low,1.031872,0.0,,
1,2.016815,7.0,11.0,16.0,Medium,4.067541,1.0,,
2,0.014207,8.0,12.0,17.0,Low,0.000202,0.0,,
3,4.053917,9.0,13.0,18.0,Medium,16.434243,1.0,,
4,4.89107,0.0,14.0,19.0,Medium,23.92257,1.0,,
15,,,,,,,,1.015811,
16,,,,,,,,,2.016815
17,,,,,,,,0.014207,
18,,,,,,,,,4.053917
19,,,,,,,,,4.89107


### MultiIndexing

In [43]:
# MultiIndexing
df_multiindexed = df.set_index(['D', 'Bins'])
df_multiindexed

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C,A_squared,Bins_encoded
D,Bins,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
15,Low,1.015811,6.0,10,1.031872,0
16,Medium,2.016815,7.0,11,4.067541,1
17,Low,0.014207,8.0,12,0.000202,0
18,Medium,4.053917,9.0,13,16.434243,1
19,Medium,4.89107,0.0,14,23.92257,1


### Stacking

In [44]:
# Stacking
df_stacked = df_multiindexed.stack()
df_stacked

D   Bins                
15  Low     A                1.015811
            B                6.000000
            C               10.000000
            A_squared        1.031872
            Bins_encoded     0.000000
16  Medium  A                2.016815
            B                7.000000
            C               11.000000
            A_squared        4.067541
            Bins_encoded     1.000000
17  Low     A                0.014207
            B                8.000000
            C               12.000000
            A_squared        0.000202
            Bins_encoded     0.000000
18  Medium  A                4.053917
            B                9.000000
            C               13.000000
            A_squared       16.434243
            Bins_encoded     1.000000
19  Medium  A                4.891070
            B                0.000000
            C               14.000000
            A_squared       23.922570
            Bins_encoded     1.000000
dtype: float64

### Hierarchical indexing

In [45]:
# Hierarchical indexing
df_hierarchical = df.set_index(['D', 'Bins'])
df_hierarchical

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C,A_squared,Bins_encoded
D,Bins,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
15,Low,1.015811,6.0,10,1.031872,0
16,Medium,2.016815,7.0,11,4.067541,1
17,Low,0.014207,8.0,12,0.000202,0
18,Medium,4.053917,9.0,13,16.434243,1
19,Medium,4.89107,0.0,14,23.92257,1


### Aggregate

In [46]:
# Aggregate
df_aggregated = df.groupby('Bins').agg({'A': 'sum', 'C': 'mean'})
df_aggregated

Unnamed: 0_level_0,A,C
Bins,Unnamed: 1_level_1,Unnamed: 2_level_1
Low,1.030018,11.0
Medium,10.961802,12.666667
High,0.0,


### Summarize data

In [47]:
# Summarize data
df_summary = df.describe()
df_summary

Unnamed: 0,A,B,C,D,A_squared,Bins_encoded
count,5.0,5.0,5.0,5.0,5.0,5.0
mean,2.398364,6.0,12.0,17.0,9.091286,0.6
std,2.043017,3.535534,1.581139,1.581139,10.567947,0.547723
min,0.014207,0.0,10.0,15.0,0.000202,0.0
25%,1.015811,6.0,11.0,16.0,1.031872,0.0
50%,2.016815,7.0,12.0,17.0,4.067541,1.0
75%,4.053917,8.0,13.0,18.0,16.434243,1.0
max,4.89107,9.0,14.0,19.0,23.92257,1.0


# 1.Common Data Problems

- Data Type Constraints
    - column in a given dataframe is of a certain data type by default and needs to be corrected/converted to another datatype for ease of calculation and analytics.
    - Example:
        - numeric value like revenue which should be an integer is stored as a string value in the data
        - categorical variable is represented as a number and mistakenly becomes an int variable 
- Data Range Constraints
    - Data that should fall within a range.
    - Example:
        - data which can only hold values between 1–5 or 1–10.
        - Subscription date column which cannot have value as future dates.    
- Uniqueness Constraints
    - duplicate values can be diagnosed when we have the same exact information repeated across multiple rows, for a some or all columns in our DataFrame.
    - Casues :
        - Data Entry Error
        - Join or Merge Errors
        - Bugs and Design Errors

In [None]:
################################################### Data type conversion
############################# Example 1

# Get data types of columns
sales.dtypes

In [None]:
type(sales['Revenue'])

In [None]:
sales['Revenue'] = sales['Revenue'].str.strip('$')
sales['Revenue'] = sales['Revenue'].astype('int')

In [None]:
############################## Example 2

# marriage_status values as numbers: 0 = Never married ,1 = Married ,2 = Separated ,3 = Divorced

df['marriage_status'].describe()

In [None]:
df["marriage_status"] = df["marriage_status"].astype('category')
df.describe()

In [None]:
##################################################### Data Range Constraints
############################# Example 1

#  movie rating data
import matplotlib.pyplot as plt
plt.hist(movies['avg_rating'])
plt.title('Average rating of movies (1-5)')

Many ways exist to deal with these out of range values, 
- like imputing the rows, 
- setting them to value 5 , or 
- setting to an average rating value 

In [None]:
import pandas as pd
# Output Movies with rating > 5
movies[movies['avg_rating'] > 5]

In [None]:
movies = movies[movies['avg_rating'] <= 5]

# Drop values using .drop()
movies.drop(movies[movies['avg_rating'] > 5].index, inplace = True)

In [None]:
# Convert avg_rating > 5 to 5
movies.loc[movies['avg_rating'] > 5, 'avg_rating'] = 5

In [None]:
############################## Example 2

# see the date range falling out of expected values:

import datetime as dt
import pandas as pd
# Output data types
user_signups.dtypes

In [None]:
# Convert to date
user_signups['subscription_date'] = pd.to_datetime(user_signups['subscription_date']).dt.date

In [None]:
# filter our wrong/future date values in the given dataframe
import datetime as dt
today_date = dt.date.today()
user_signups[user_signups['subscription_date'] > dt.date.today()]
# this is an old code and 'today's date' was 1/1/2020 , result has been filtered accordingly

In [None]:
# replace the values of this column which are showing future dates with today’s date using the loc function

today_date = dt.date.today()
# Drop values using filtering
user_signups = user_signups[user_signups['subscription_date'] < today_date]
# Drop values using .drop()
user_signups.drop(user_signups[user_signups['subscription_date'] > today_date].index, inplace = True)
# Drop values using filtering
user_signups.loc[user_signups['subscription_date'] > today_date, 'subscription_date'] = today_date
# Assert its true
assert user_signups.subscription_date.max().date() <= today_date

In [None]:
################################################# Uniqueness: How to find duplicate Values
########################### Example 1

height_weight.head()

In [None]:
# finding duplicates in a DataFrame by using the duplicated() method.
# returns a Series of boolean values that are True for duplicate values, and False for non-duplicated values.

duplicates = height_weight.duplicated()
print(duplicates)

In [None]:
# as all the columns are required to have duplicate values by default, with all duplicate values being marked as True except for the first occurrence
# To calibrate how we go about finding duplicates, we will use 2 arguments from the duplicated() method.
### The subset argument lets us set a list of column names to check for duplication.
###  The keep argument lets us keep the first occurrence of a duplicate value by setting it to the string first, the last occurrence of a duplicate value by setting it the string last, or keep all occurrences of duplicate values by setting it to False.

In [None]:
column_names = ['first_name','last_name','address']
duplicates = height_weight.duplicated(subset = column_names, keep = False)

In [None]:
height_weight[duplicates].sort_values(by = 'first_name')

In [None]:
# Treating Duplicates
# drop_duplicates() method — takes same set of arguments as duplicates method — subset and inplace

In [None]:
height_weight.drop_duplicates(inplace = True)

# 2. Text and Categorical Data Problems

categorical variable is a variable that can take on one of a limited, and usually fixed, number of possible values. Hence categorical data represent variables that represent predefined finite set of categories. Categorical data represent a predefined set of categories, they can’t have values that go beyond these predefined categories.
- Examples:
    - marriage status, 
    - household income categories, 
    - loan status 
    
machine learning models run on categorical data are often coded as numbers.

- Membership constraints
    - inconsistencies in our categorical data
    - reasons:
        - due to data entry issues with free text vs dropdown fields, 
        - data parsing errors and other types of errors.
- Categorical variables
    - Examples:
        - value inconsistency, 
        - the presence of too many categories that could be collapsed into one, and 
        - making sure data is of the right type.
- Cleaning text data
    - Examples of text data problems include:
        - handling inconsistencies, 
        - making sure text data is of a certain length, 
        - typos

In [None]:
############################################## Dealng with Inconsistencies
####################### Example 1

# Check predefined set of Catagories
# The categories dataframe will help us systematically spot all rows with these inconsistencies.

print(Catagories_Data)

##### Use joins to fix categorical inconsistencies in data, 2 main types: anti joins and inner joins.
join DataFrames on common columns between them. 
- Anti joins, take in two DataFrames A and B, and return data from one DataFrame that is not contained in another.
- Example, 
    - we are performing a left anti join of A and B, and are returning the columns of DataFrames A and B for values only found in A of the common column between them being joined on.
- Inner joins, return only the data that is contained in both DataFrames. 
- Example,
    - inner join of A and B, would return columns from both DataFrames for values only found in A and B, of the common column between them being joined on.

In [None]:
#  left anti join essentially returns all the data in study data with inconsistent blood types

inconsistent_categories = set(study_data['blood_type']).difference(categories['blood_type'])
print(inconsistent_categories)

In [None]:
# find the row which has this inconsistent value :

# Get and print rows with inconsistent categories
inconsistent_rows = study_data['blood_type'].isin(inconsistent_categories)
study_data[inconsistent_rows]

drop inconsistent rows and keep ones that are only consistent. We just use the tilde symbol while subsetting which returns everything except inconsistent rows.

In [None]:
inconsistent_categories = set(study_data['blood_type']).difference(categories['blood_type'])
inconsistent_rows = study_data['blood_type'].isin(inconsistent_categories)
inconsistent_data = study_data[inconsistent_rows]
# Drop inconsistent categories and get consistent data only
consistent_data = study_data[~inconsistent_rows]

In [None]:
############################################# Value Inconsistancy and collased into single catagory
######################### Example 1

# Get marriage status column
marriage_status = demographics['marriage_status']
marriage_status.value_counts()

In [None]:
# deal by capitalize or lowercase the marriage_status column. 
# done with the str-dot-upper() or dot-lower() functions

In [None]:
# Capitalize
marriage_status['marriage_status'] = marriage_status['marriage_status'].str.upper()
marriage_status['marriage_status'].value_counts()

In [None]:
# Lowercase
marriage_status['marriage_status'] = marriage_status['marriage_status'].str.lower()
marriage_status['marriage_status'].value_counts()

In [None]:
######################### Example 2
# Leading or Trailing spaces: ‘married ‘ , ‘married’ , ‘unmarried’ , ‘ unmarried’ 

In [None]:
# Get marriage status column
marriage_status = demographics['marriage_status']
marriage_status.value_counts()

In [None]:
# remove leading spaces, we can use the str-dot-strip() method which when given no input, 
# strips all leading and trailing white spaces.

In [None]:
# Strip all spaces
demographics = demographics['marriage_status'].str.strip()
demographics['marriage_status'].value_counts()

In [None]:
############################################# Collapse values in a column into categories or bins.
# number of categories is very high, and we may need to collapse them into a smaller group by mapping few categories into a broader category .
############################# Example 1

# collapses into bins
# using the cut() method : lets us define category cutoff ranges with the bins argument. 
# It takes in a list of cutoff points for each category, with the final one being infinity represented with np-dot-inf().

# Using cut() - create category ranges and names
ranges = [0,200000,500000,np.inf]
group_names = ['0-200K', '200K-500K', '500K+']
# Create income group column
demographics['income_group'] = pd.cut(demographics['household_income'], bins=ranges, labels=group_names)
demographics[['income_group', 'household_income']]

In [None]:
########################## Example 2

# Mapping categories to fewer ones :
# operating_system column is: ‘Microsoft’, ‘MacOS’, ‘IOS’, ‘Android’, ‘Linux’
# operating_system column should become: ‘DesktopOS’, ‘MobileOS’

In [None]:
# Create mapping dictionary and replace
mapping = {'Microsoft':'DesktopOS', 'MacOS':'DesktopOS', 'Linux':'DesktopOS',
'IOS':'MobileOS', 'Android':'MobileOS'}
devices['operating_system'] = devices['operating_system'].replace(mapping)
devices['operating_system'].unique() # to check result

In [None]:
###################################################### Cleaning text Data
####################### Example 1

# Use cases
# feed these phone numbers into an automated call system
# create a report discussing the distribution of users by area code

# What we expect :
### 1. Phone numbers are aligned to begin with 00
### 2. Any number below the 10 digit value is replaced with NaN to represent a missing value, and
### 3. All the dashes have been removed

In [None]:
# Replace "+" with "00"
phones["Phone number"] = phones["Phone number"].str.replace("+","00") 
phones

In [None]:
# Replace "-" with nothing
phones["Phone number"] = phones["Phone number"].str.replace("-","")
phones

In [None]:
# Replace phone numbers with lower than 10 digits to NaN
digits = phones['Phone number'].str.len() 
phones.loc[digits <10,"Phone number"] = np.nan 
phones

In [None]:
###################################################### Regular Expressions
####################### Example 2

# Column contains a range of symbols from plus signs, dashes, parenthesis.
# Regular expressions give us the ability to search for any pattern in text data

In [None]:
# Replace letters with nothing
phones['Phone number'] = phones['Phone number'].str.replace(r'\D+', '')
phones.head()

# 3. Advanced Data Problems
- Uniformity
- Cross field validation
- Completeness

# 4. Record Linkage
- String comparison
- Generating pairs
- Linking Dataframes