# Creating The Artificial Dataset

In [164]:
import random
import pandas as pd
import numpy as np
import warnings 
warnings.filterwarnings('ignore')
from datetime import datetime, timedelta

# Add The Columns
names = ['Jai', 'Anuj', 'Princi', 'Gaurav', 'Abhi', 'Ravi', 'Amit',
         'Rahul', 'Kamal', 'Vikram', 'Sachin', 'Ankit', 'Mukesh', 'Saurabh', 
         'Rajesh', 'Suresh', 'Praveen', 'Vijay', 'Sandeep', 'Sunil', 'Deepak',
         'Vinod', 'Manish', 'Tarun',None,None]
addresses = ['Nagpur', 'Kanpur', 'Allahabad', 'Kannauj', 'Jaunpur', 'Aligarh',
             'Lucknow', 'Bhopal',None,None]
qualifications = ['Msc', 'MA', 'MCA', 'Phd', 'B.Tech', 'B.com', 'B.A',
                  'Diploma',None,None]
data1 = {'Name': [], 'Age': [], 'Address': [], 'Qualification': [], 'Score': []}

for i in range(2000):
    data1['Name'].append(random.choice(names))
    data1['Age'].append(random.randint(22, 60))
    data1['Address'].append(random.choice(addresses))
    data1['Qualification'].append(random.choice(qualifications))
    data1['Score'].append(random.randint(20, 100))

    
       
df = pd.DataFrame(data1)
   
# Adding A Date column
start_date = datetime(2020, 1, 1)
end_date = datetime(2023, 2, 4)
time_between_dates = end_date - start_date
days_between_dates = time_between_dates.days
df['Date'] = [start_date + timedelta(days=random.randint(0, days_between_dates)) for i in range(df.shape[0])]

# Changing the Data Type
df['Date'] = df['Date'].astype(str)
df['Age'] = df['Age'].astype(str)
df['Score'] = df['Score'].astype(str)

# Add The Duplicated Data
duplicated_rows = df.loc[random.sample(list(range(2000)), 500)].copy()
df = df.append(duplicated_rows, ignore_index=True)
rows = list(range(df.shape[0]))


# Add The Null Data
rows = random.sample(rows, int(df.shape[0] * 0.05))

columns = random.sample(list(df.columns), int(df.shape[1] * 0.2))

for column in columns:
    df.loc[rows, column] = np.nan
    
    
    
# Shuffle The Data
df = df.sample(frac=1, random_state=0).reset_index(drop=True)


# Final Data
print(df)

         Name Age    Address Qualification Score        Date
0         Jai  44     Nagpur           MCA    31  2022-05-11
1        None  51        NaN         B.com    72  2020-10-21
2      Princi  48    Lucknow           Msc    88  2022-12-27
3      Gaurav  58    Lucknow           Msc    77  2021-06-05
4      Mukesh  26  Allahabad       Diploma    81  2021-09-05
...       ...  ..        ...           ...   ...         ...
2495  Praveen  35    Kannauj           B.A    90  2023-01-16
2496   Suresh  29  Allahabad          None    58  2022-10-25
2497     None  45     Nagpur           Msc    22  2021-01-26
2498   Mukesh  56       None       Diploma    32  2022-01-25
2499    Vijay  38    Jaunpur           Msc    84  2022-08-02

[2500 rows x 6 columns]


We have generated this dataset containing duplicate and missing values for the purpose of demonstrating the significance of data cleaning and improving our understanding of the process.

# Data Exploration and Data Cleaning

Data Exploration is the process of analyzing and summarizing the characteristics and patterns of a dataset in order to gain a better understanding of the data and identify any potential issues that may need to be addressed before using the data for further analysis or modeling

Data Cleaning is the process of correcting or removing errors, inconsistencies, and inaccuracies in a dataset in order to improve the quality and reliability of the data for analysis and modeling

### Checking Anomalies in Data Statistics and Types

The **describe method** can be useful for quickly getting an overview of the distribution of your data and detecting any outliers or anomalies. It can also help identify any missing or non-numeric values in your data.

In [165]:
df.describe(include='all')

Unnamed: 0,Name,Age,Address,Qualification,Score,Date
count,2315,2500,1900,2053,2500,2500
unique,24,39,8,8,81,933
top,Suresh,54,Nagpur,B.com,74,2020-03-09
freq,113,87,274,280,46,11


The info method in pandas is used to obtain a concise summary of a dataframe, including the number of non-missing values in each column, the data type of each column, and the memory usage of the dataframe.

In [166]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2500 entries, 0 to 2499
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Name           2315 non-null   object
 1   Age            2500 non-null   object
 2   Address        1900 non-null   object
 3   Qualification  2053 non-null   object
 4   Score          2500 non-null   object
 5   Date           2500 non-null   object
dtypes: object(6)
memory usage: 117.3+ KB


It appears that integer and datetime data are stored as string data types. To ensure proper data analysis and manipulation, it is necessary to perform data type conversion and convert these values into their appropriate integer and datetime data types.

### Data Type Conversion

In [167]:
df['Score'] = df['Score'].astype(int)
df['Age'] = df['Age'].astype(float)
df['Date'] = pd.to_datetime(df['Date'])

In [168]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2500 entries, 0 to 2499
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Name           2315 non-null   object        
 1   Age            2500 non-null   float64       
 2   Address        1900 non-null   object        
 3   Qualification  2053 non-null   object        
 4   Score          2500 non-null   int32         
 5   Date           2500 non-null   datetime64[ns]
dtypes: datetime64[ns](1), float64(1), int32(1), object(3)
memory usage: 107.5+ KB


### Checking For Duplicated Rows

The duplicated method in pandas is used to check for duplicate rows in a dataframe. The method returns a boolean series with True values for the rows that are duplicates and False values for the unique rows.

The value_counts method can then be applied to the resulting series to get a count of the unique values, i.e., the number of duplicate and unique rows in the dataframe.

In [169]:
df.duplicated().value_counts()

False    2032
True      468
dtype: int64

Keep or drop duplicates depends on the purpose, sometimes keep for accuracy and sometimes drop to avoid mistakes, based on the goal of the analysis.
In this Case We are dropping the Duplicated Columns

In [170]:
df.drop_duplicates(inplace=True)
df.duplicated().value_counts()

False    2032
dtype: int64

### Checking for Null Values

The **isnull** method in pandas is used to check for missing values in a dataframe. The method returns a dataframe of the same shape as the original dataframe, with True values in the cells where the corresponding value in the original dataframe is missing (i.e., NaN or None), and False values in the cells where the corresponding value is not missing.

The **sum** method can then be applied to the resulting dataframe to calculate the number of missing values in each column.

Adding Null Values In the Data For Null Data Cleaning

In [171]:
rows = np.random.choice(df.index, size=250, replace=False)
df.loc[rows, 'Age'] = np.nan
df.loc[rows, 'Score'] = np.nan

In [172]:
df.isnull().sum()

Name             148
Age              250
Address          501
Qualification    365
Score            250
Date               0
dtype: int64

### Handling missing values

1.For The Categorical Data

In [173]:
df['Name'].fillna('Unknown',inplace=True)
df['Address'].fillna('Unknown',inplace=True)
df['Qualification'].fillna('Unknown',inplace=True)
df.isnull().sum()

Name               0
Age              250
Address            0
Qualification      0
Score            250
Date               0
dtype: int64

2.For The Numerical Data

(a) Age

In [174]:
df.loc[df['Age'].isnull() ==True,'Qualification'].value_counts()

Unknown    44
Msc        32
B.com      31
MA         28
Phd        28
B.Tech     25
B.A        23
Diploma    20
MCA        19
Name: Qualification, dtype: int64

In [175]:
df.loc[df['Age'].isnull() ==True,'Name'].value_counts()

Unknown    21
Praveen    17
Amit       14
Vinod      12
Saurabh    12
Vikram     11
Kamal      11
Manish     11
Rahul      11
Sandeep    11
Sunil      10
Suresh     10
Gaurav      9
Abhi        9
Anuj        9
Sachin      9
Jai         9
Mukesh      8
Vijay       8
Ravi        7
Tarun       7
Ankit       7
Rajesh      6
Deepak      6
Princi      5
Name: Name, dtype: int64

In [176]:
df.loc[df['Age'].isnull() ==True,'Address'].value_counts()

Unknown      65
Kannauj      30
Kanpur       30
Lucknow      25
Bhopal       25
Aligarh      21
Jaunpur      18
Nagpur       18
Allahabad    18
Name: Address, dtype: int64

Creating Separate Date Column to Check for Variation in Null Values

In [177]:
df['start_year'] = df.Date.dt.year
df['start_month'] = df.Date.dt.month_name()
df['start_date'] = df.Date.dt.day
df['week_day']  = df.Date.dt.day_name()

In [178]:
df.loc[df['Age'].isnull() ==True,'start_year'].value_counts()

2021    85
2020    80
2022    80
2023     5
Name: start_year, dtype: int64

In [179]:
df.loc[df['Age'].isnull() ==True,'start_month'].value_counts()

April        27
August       26
June         25
May          24
July         24
January      23
February     20
December     20
March        18
November     16
October      14
September    13
Name: start_month, dtype: int64

In [180]:
df.loc[df['Age'].isnull() ==True,'start_date'].value_counts()

21    13
22    13
23    12
26    12
14    11
5     10
28    10
11     9
19     9
12     9
9      9
1      9
10     9
25     8
17     8
15     8
29     8
20     8
16     8
4      8
18     7
8      6
6      6
24     6
31     5
2      5
27     5
7      5
3      5
13     5
30     4
Name: start_date, dtype: int64

In [181]:
df.loc[df['Age'].isnull() ==True,'week_day'].value_counts()

Sunday       46
Wednesday    40
Monday       38
Tuesday      38
Friday       30
Thursday     29
Saturday     29
Name: week_day, dtype: int64

Since the missing values don't seem to be connected to any of the other columns, we can fill them by by taking the mean of the two closest non-null values in the same column.

In [190]:
import math
def fill_na_mean_closest_two(col):
    """
    Fills missing values in the column by taking the mean of the two closest non-null values.
    """
    not_null = col.notnull()
    idx = col.index[not_null]
    val = col.loc[not_null].values
    
    filled = col.copy()
    for i, row in col[~not_null].iteritems():
        j = np.searchsorted(idx, i, side='left')
        if j > 0 and (j == len(idx) or math.fabs(i - idx[j-1]) < math.fabs(i - idx[j])):
            j -= 1
        filled.loc[i] = (val[j-1] + val[j]) / 2
    return filled

In [191]:
df['Age'] = fill_na_mean_closest_two(df['Age'])

In [192]:
df.isnull().sum()

Name               0
Age                0
Address            0
Qualification      0
Score            250
Date               0
start_year         0
start_month        0
start_date         0
week_day           0
dtype: int64

In [193]:
df.isnull().sum()

Name               0
Age                0
Address            0
Qualification      0
Score            250
Date               0
start_year         0
start_month        0
start_date         0
week_day           0
dtype: int64

(b) Score

Lets Repeat the Same method for the Score Column

In [194]:
df.loc[df['Score'].isnull() ==True,'Qualification'].value_counts()

Unknown    44
Msc        32
B.com      31
MA         28
Phd        28
B.Tech     25
B.A        23
Diploma    20
MCA        19
Name: Qualification, dtype: int64

In [103]:
df.loc[df['Score'].isnull() ==True,'Name'].value_counts()

Unknown    47
Gaurav     25
Deepak     20
Manish     19
Ankit      19
Princi     19
Anuj       18
Ravi       18
Sunil      18
Suresh     17
Praveen    17
Jai        17
Rahul      17
Sachin     16
Vijay      16
Saurabh    16
Vikram     14
Rajesh     14
Vinod      13
Tarun      12
Mukesh     12
Abhi       12
Kamal      10
Sandeep     9
Amit        9
Name: Name, dtype: int64

In [105]:
df.loc[df['Score'].isnull() ==True,'Address'].value_counts()

Unknown      85
Aligarh      50
Nagpur       46
Allahabad    45
Lucknow      43
Bhopal       42
Kanpur       38
Kannauj      38
Jaunpur      37
Name: Address, dtype: int64

In [156]:
df.loc[df['Score'].isnull() ==True,'start_year'].value_counts()

2020    92
2022    82
2021    74
2023     2
Name: start_year, dtype: int64

In [158]:
df.loc[df['Score'].isnull() ==True,'start_month'].value_counts()

March        29
January      26
August       24
April        22
May          20
February     20
November     20
October      20
July         19
June         19
September    17
December     14
Name: start_month, dtype: int64

In [160]:
df.loc[df['Score'].isnull() ==True,'start_date'].value_counts()

19    15
17    14
22    13
30    12
8     11
20    11
26    11
9     10
13    10
6     10
3     10
5     10
27     9
11     9
12     9
18     8
23     8
24     8
15     7
10     7
7      7
25     7
14     7
21     7
2      4
29     4
4      3
28     3
16     3
1      3
Name: start_date, dtype: int64

In [162]:
df.loc[df['Score'].isnull() ==True,'week_day'].value_counts()

Sunday       48
Thursday     35
Friday       35
Saturday     35
Wednesday    34
Monday       33
Tuesday      30
Name: week_day, dtype: int64

In [195]:
df['Score'] = fill_na_mean_closest_two(df['Score'])

In [196]:
df.isnull().sum()

Name             0
Age              0
Address          0
Qualification    0
Score            0
Date             0
start_year       0
start_month      0
start_date       0
week_day         0
dtype: int64

# Outlier detection

This Method performs outlier detection on a specific column in a pandas dataframe. The method used to detect outliers is based on the Z-score, which is a standardized measure of how many standard deviations a value is from the mean.


Outlier detection is an important step in data cleaning and preparation. Outliers can significantly affect the results of statistical analysis, so it's important to identify and handle them appropriately.

In [197]:
mean = df['Score'].mean()
std = df['Score'].std()

z_scores = (df['Score'] - mean) / std
outliers = df[np.abs(z_scores) > 3]

print(outliers)

Empty DataFrame
Columns: [Name, Age, Address, Qualification, Score, Date, start_year, start_month, start_date, week_day]
Index: []


In [198]:
mean = df['Age'].mean()
std = df['Age'].std()

z_scores = (df['Age'] - mean) / std
outliers = df[np.abs(z_scores) > 3]

print(outliers)

Empty DataFrame
Columns: [Name, Age, Address, Qualification, Score, Date, start_year, start_month, start_date, week_day]
Index: []


# Summary

After thoroughly reviewing the data, we discovered that there were duplicates, missing values, and incorrect data types present. In order to enhance the quality and reliability of the data, we conducted both Data Exploration and Data Cleaning. The outcome of these processes has resulted in a cleaned dataset, which can now be used for further analysis or modeling. Here is the cleaned data ready for the next steps.

In [199]:
df.head()

Unnamed: 0,Name,Age,Address,Qualification,Score,Date,start_year,start_month,start_date,week_day
0,Jai,44.0,Nagpur,MCA,31.0,2022-05-11,2022,May,11,Wednesday
1,Unknown,46.0,Unknown,B.com,59.5,2020-10-21,2020,October,21,Wednesday
2,Princi,48.0,Lucknow,Msc,88.0,2022-12-27,2022,December,27,Tuesday
3,Gaurav,58.0,Lucknow,Msc,77.0,2021-06-05,2021,June,5,Saturday
4,Mukesh,26.0,Allahabad,Diploma,81.0,2021-09-05,2021,September,5,Sunday
