# Creating The Artificial Dataset

In [40]:
import random
import pandas as pd
import numpy as np
import warnings 
warnings.filterwarnings('ignore')
from datetime import datetime, timedelta

# Add The Columns
names = ['Jai', 'Anuj', 'Princi', 'Gaurav', 'Abhi', 'Ravi', 'Amit',
         'Rahul', 'Kamal', 'Vikram', 'Sachin', 'Ankit', 'Mukesh', 'Saurabh', 
         'Rajesh', 'Suresh', 'Praveen', 'Vijay', 'Sandeep', 'Sunil', 'Deepak',
         'Vinod', 'Manish', 'Tarun',None,None]
addresses = ['Nagpur', 'Kanpur', 'Allahabad', 'Kannauj', 'Jaunpur', 'Aligarh',
             'Lucknow', 'Bhopal',None,None]
qualifications = ['Msc', 'MA', 'MCA', 'Phd', 'B.Tech', 'B.com', 'B.A',
                  'Diploma',None,None]
data1 = {'Name': [], 'Age': [], 'Address': [], 'Qualification': [], 'Score': []}

for i in range(2000):
    data1['Name'].append(random.choice(names))
    data1['Age'].append(random.randint(22, 60))
    data1['Address'].append(random.choice(addresses))
    data1['Qualification'].append(random.choice(qualifications))
    data1['Score'].append(random.randint(20, 100))

    
       
df = pd.DataFrame(data1)
   
# Adding A Date column
start_date = datetime(2020, 1, 1)
end_date = datetime(2023, 2, 4)
time_between_dates = end_date - start_date
days_between_dates = time_between_dates.days
df['Date'] = [start_date + timedelta(days=random.randint(0, days_between_dates)) for i in range(df.shape[0])]

# Changing the Data Type
df['Date'] = df['Date'].astype(str)
df['Age'] = df['Age'].astype(str)
df['Score'] = df['Score'].astype(str)

# Add The Duplicated Data
duplicated_rows = df.loc[random.sample(list(range(2000)), 500)].copy()
df = df.append(duplicated_rows, ignore_index=True)
rows = list(range(df.shape[0]))


# Add The Null Data
rows = random.sample(rows, int(df.shape[0] * 0.05))

columns = random.sample(list(df.columns), int(df.shape[1] * 0.2))

for column in columns:
    df.loc[rows, column] = np.nan
    
    
    
# Shuffle The Data
df = df.sample(frac=1, random_state=0).reset_index(drop=True)


# Final Data
print(df)

         Name Age  Address Qualification Score        Date
0      Mukesh  58  Aligarh           B.A    21  2020-07-29
1      Sachin  46  Aligarh           Phd    74  2022-07-20
2         Jai  40   Nagpur           Msc    37  2020-04-29
3      Manish  36     None        B.Tech    45  2021-01-04
4     Praveen  27     None        B.Tech    43  2020-06-20
...       ...  ..      ...           ...   ...         ...
2495     Ravi  56     None          None    57  2022-03-08
2496     None  27  Jaunpur         B.com    28         NaN
2497    Ankit  32   Kanpur          None    60         NaN
2498      Jai  40   Nagpur           B.A    64  2020-07-21
2499      Jai  47  Jaunpur           B.A    85  2021-06-25

[2500 rows x 6 columns]


We have generated this dataset containing duplicate and missing values for the purpose of demonstrating the significance of data cleaning and improving our understanding of the process.

# Data Exploration and Data Cleaning

Data Exploration is the process of analyzing and summarizing the characteristics and patterns of a dataset in order to gain a better understanding of the data and identify any potential issues that may need to be addressed before using the data for further analysis or modeling

Data Cleaning is the process of correcting or removing errors, inconsistencies, and inaccuracies in a dataset in order to improve the quality and reliability of the data for analysis and modeling

### Checking Anomalies in Data Statistics and Types

The **describe method** can be useful for quickly getting an overview of the distribution of your data and detecting any outliers or anomalies. It can also help identify any missing or non-numeric values in your data.

In [41]:
df.describe(include='all')

Unnamed: 0,Name,Age,Address,Qualification,Score,Date
count,2290,2500,1949,2001,2500,2375
unique,24,39,8,8,81,907
top,Ankit,28,Jaunpur,B.com,87,2021-01-31
freq,118,84,259,274,50,10


The info method in pandas is used to obtain a concise summary of a dataframe, including the number of non-missing values in each column, the data type of each column, and the memory usage of the dataframe.

In [42]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2500 entries, 0 to 2499
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Name           2290 non-null   object
 1   Age            2500 non-null   object
 2   Address        1949 non-null   object
 3   Qualification  2001 non-null   object
 4   Score          2500 non-null   object
 5   Date           2375 non-null   object
dtypes: object(6)
memory usage: 117.3+ KB


It appears that integer and datetime data are stored as string data types. To ensure proper data analysis and manipulation, it is necessary to perform data type conversion and convert these values into their appropriate integer and datetime data types.

### Data Type Conversion

In [43]:
df['Score'] = df['Score'].astype(int)
df['Age'] = df['Age'].astype(float)
df['Date'] = pd.to_datetime(df['Date'])

In [44]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2500 entries, 0 to 2499
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Name           2290 non-null   object        
 1   Age            2500 non-null   float64       
 2   Address        1949 non-null   object        
 3   Qualification  2001 non-null   object        
 4   Score          2500 non-null   int32         
 5   Date           2375 non-null   datetime64[ns]
dtypes: datetime64[ns](1), float64(1), int32(1), object(3)
memory usage: 107.5+ KB


### Checking For Duplicated Rows

The duplicated method in pandas is used to check for duplicate rows in a dataframe. The method returns a boolean series with True values for the rows that are duplicates and False values for the unique rows.

The value_counts method can then be applied to the resulting series to get a count of the unique values, i.e., the number of duplicate and unique rows in the dataframe.

In [45]:
df.duplicated().value_counts()

False    2050
True      450
dtype: int64

Keep or drop duplicates depends on the purpose, sometimes keep for accuracy and sometimes drop to avoid mistakes, based on the goal of the analysis.
In this Case We are dropping the Duplicated Columns

In [46]:
df.drop_duplicates(inplace=True)
df.duplicated().value_counts()

False    2050
dtype: int64

### Checking for Null Values

The **isnull** method in pandas is used to check for missing values in a dataframe. The method returns a dataframe of the same shape as the original dataframe, with True values in the cells where the corresponding value in the original dataframe is missing (i.e., NaN or None), and False values in the cells where the corresponding value is not missing.

The **sum** method can then be applied to the resulting dataframe to calculate the number of missing values in each column.

Adding Null Values In the Data For Null Data Cleaning

In [47]:
rows = np.random.choice(df.index, size=250, replace=False)
df.loc[rows, 'Age'] = np.nan
df.loc[rows, 'Score'] = np.nan

In [48]:
df.isnull().sum()

Name             173
Age              250
Address          454
Qualification    401
Score            250
Date             124
dtype: int64

### Handling missing values

1.For The Categorical Data

In [49]:
df['Name'].fillna('Unknown',inplace=True)
df['Address'].fillna('Unknown',inplace=True)
df['Qualification'].fillna('Unknown',inplace=True)
df.isnull().sum()

Name               0
Age              250
Address            0
Qualification      0
Score            250
Date             124
dtype: int64

2.For The Numerical Data

(a) Age

In [50]:
df.loc[df['Age'].isnull() ==True,'Qualification'].value_counts()

Unknown    42
B.com      30
Msc        29
MCA        28
MA         28
B.A        26
Phd        24
B.Tech     22
Diploma    21
Name: Qualification, dtype: int64

In [51]:
df.loc[df['Age'].isnull() ==True,'Name'].value_counts()

Suresh     17
Ankit      16
Unknown    16
Vijay      16
Princi     13
Amit       13
Kamal      12
Rahul      12
Manish     12
Jai        11
Saurabh    11
Sandeep    10
Praveen    10
Vinod       9
Deepak      8
Anuj        8
Gaurav      8
Rajesh      7
Vikram      7
Abhi        7
Tarun       6
Sachin      6
Sunil       5
Ravi        5
Mukesh      5
Name: Name, dtype: int64

In [52]:
df.loc[df['Age'].isnull() ==True,'Address'].value_counts()

Unknown      48
Kannauj      36
Allahabad    30
Nagpur       27
Aligarh      27
Kanpur       25
Lucknow      22
Bhopal       18
Jaunpur      17
Name: Address, dtype: int64

Creating Separate Date Column to Check for Variation in Null Values

In [53]:
df['start_year'] = df.Date.dt.year
df['start_month'] = df.Date.dt.month_name()
df['start_date'] = df.Date.dt.day
df['week_day']  = df.Date.dt.day_name()

In [54]:
df.loc[df['Age'].isnull() ==True,'start_year'].value_counts()

2020.0    84
2021.0    74
2022.0    65
2023.0    11
Name: start_year, dtype: int64

In [55]:
df.loc[df['Age'].isnull() ==True,'start_month'].value_counts()

January      38
August       26
March        23
May          19
June         18
October      18
April        17
July         17
December     16
September    16
February     14
November     12
Name: start_month, dtype: int64

In [56]:
df.loc[df['Age'].isnull() ==True,'start_date'].value_counts()

5.0     15
10.0    14
4.0     12
22.0    12
2.0     11
21.0    11
9.0     11
13.0    11
29.0     9
14.0     8
20.0     8
1.0      7
12.0     7
23.0     7
7.0      7
25.0     7
3.0      7
18.0     7
28.0     7
11.0     6
16.0     6
6.0      5
15.0     5
26.0     5
30.0     5
27.0     4
19.0     4
8.0      4
31.0     4
17.0     4
24.0     4
Name: start_date, dtype: int64

In [57]:
df.loc[df['Age'].isnull() ==True,'week_day'].value_counts()

Monday       41
Friday       37
Tuesday      36
Wednesday    35
Saturday     32
Sunday       29
Thursday     24
Name: week_day, dtype: int64

Since the missing values don't seem to be connected to any of the other columns, we can fill them by by taking the mean of the two closest non-null values in the same column.

In [58]:
import math
def fill_na_mean_closest_two(col):
    """
    Fills missing values in the column by taking the mean of the two closest non-null values.
    """
    not_null = col.notnull()
    idx = col.index[not_null]
    val = col.loc[not_null].values
    
    filled = col.copy()
    for i, row in col[~not_null].iteritems():
        j = np.searchsorted(idx, i, side='left')
        if j > 0 and (j == len(idx) or math.fabs(i - idx[j-1]) < math.fabs(i - idx[j])):
            j -= 1
        filled.loc[i] = (val[j-1] + val[j]) / 2
    return filled

In [59]:
df['Age'] = fill_na_mean_closest_two(df['Age'])

In [60]:
df.isnull().sum()

Name               0
Age                0
Address            0
Qualification      0
Score            250
Date             124
start_year       124
start_month      124
start_date       124
week_day         124
dtype: int64

In [61]:
df.isnull().sum()

Name               0
Age                0
Address            0
Qualification      0
Score            250
Date             124
start_year       124
start_month      124
start_date       124
week_day         124
dtype: int64

(b) Score

Lets Repeat the Same method for the Score Column

In [62]:
df.loc[df['Score'].isnull() ==True,'Qualification'].value_counts()

Unknown    42
B.com      30
Msc        29
MCA        28
MA         28
B.A        26
Phd        24
B.Tech     22
Diploma    21
Name: Qualification, dtype: int64

In [63]:
df.loc[df['Score'].isnull() ==True,'Name'].value_counts()

Suresh     17
Ankit      16
Unknown    16
Vijay      16
Princi     13
Amit       13
Kamal      12
Rahul      12
Manish     12
Jai        11
Saurabh    11
Sandeep    10
Praveen    10
Vinod       9
Deepak      8
Anuj        8
Gaurav      8
Rajesh      7
Vikram      7
Abhi        7
Tarun       6
Sachin      6
Sunil       5
Ravi        5
Mukesh      5
Name: Name, dtype: int64

In [64]:
df.loc[df['Score'].isnull() ==True,'Address'].value_counts()

Unknown      48
Kannauj      36
Allahabad    30
Nagpur       27
Aligarh      27
Kanpur       25
Lucknow      22
Bhopal       18
Jaunpur      17
Name: Address, dtype: int64

In [65]:
df.loc[df['Score'].isnull() ==True,'start_year'].value_counts()

2020.0    84
2021.0    74
2022.0    65
2023.0    11
Name: start_year, dtype: int64

In [66]:
df.loc[df['Score'].isnull() ==True,'start_month'].value_counts()

January      38
August       26
March        23
May          19
June         18
October      18
April        17
July         17
December     16
September    16
February     14
November     12
Name: start_month, dtype: int64

In [67]:
df.loc[df['Score'].isnull() ==True,'start_date'].value_counts()

5.0     15
10.0    14
4.0     12
22.0    12
2.0     11
21.0    11
9.0     11
13.0    11
29.0     9
14.0     8
20.0     8
1.0      7
12.0     7
23.0     7
7.0      7
25.0     7
3.0      7
18.0     7
28.0     7
11.0     6
16.0     6
6.0      5
15.0     5
26.0     5
30.0     5
27.0     4
19.0     4
8.0      4
31.0     4
17.0     4
24.0     4
Name: start_date, dtype: int64

In [68]:
df.loc[df['Score'].isnull() ==True,'week_day'].value_counts()

Monday       41
Friday       37
Tuesday      36
Wednesday    35
Saturday     32
Sunday       29
Thursday     24
Name: week_day, dtype: int64

In [69]:
df['Score'] = fill_na_mean_closest_two(df['Score'])

In [70]:
df.isnull().sum()

Name               0
Age                0
Address            0
Qualification      0
Score              0
Date             124
start_year       124
start_month      124
start_date       124
week_day         124
dtype: int64

### Forward-filling The Dates

In [None]:
df['Date'].interpolate(method='ffill', inplace=True)
df['start_year'].interpolate(method='ffill', inplace=True)
df['start_month'].interpolate(method='ffill', inplace=True)
df['start_date'].interpolate(method='ffill', inplace=True)
df['week_day'].interpolate(method='ffill', inplace=True)
df.info()

### Outlier detection

This Method performs outlier detection on a specific column in a pandas dataframe. The method used to detect outliers is based on the Z-score, which is a standardized measure of how many standard deviations a value is from the mean.


Outlier detection is an important step in data cleaning and preparation. Outliers can significantly affect the results of statistical analysis, so it's important to identify and handle them appropriately.

In [71]:
mean = df['Score'].mean()
std = df['Score'].std()

z_scores = (df['Score'] - mean) / std
outliers = df[np.abs(z_scores) > 3]

print(outliers)

Empty DataFrame
Columns: [Name, Age, Address, Qualification, Score, Date, start_year, start_month, start_date, week_day]
Index: []


In [72]:
mean = df['Age'].mean()
std = df['Age'].std()

z_scores = (df['Age'] - mean) / std
outliers = df[np.abs(z_scores) > 3]

print(outliers)

Empty DataFrame
Columns: [Name, Age, Address, Qualification, Score, Date, start_year, start_month, start_date, week_day]
Index: []


# Summary

After thoroughly reviewing the data, we discovered that there were duplicates, missing values, and incorrect data types present and finally It Had Some Empty Date Columns. In order to enhance the quality and reliability of the data, we conducted both Data Exploration and Data Cleaning. The outcome of these processes has resulted in a cleaned dataset, which can now be used for further analysis or modeling. Here is the cleaned data ready for the next steps.

In [79]:
df.head()

Unnamed: 0,Name,Age,Address,Qualification,Score,Date,start_year,start_month,start_date,week_day
0,Mukesh,58.0,Aligarh,B.A,21.0,2020-07-29,2020.0,July,29.0,Wednesday
1,Sachin,46.0,Aligarh,Phd,74.0,2022-07-20,2022.0,July,20.0,Wednesday
2,Jai,52.0,Nagpur,Msc,47.5,2020-04-29,2020.0,April,29.0,Wednesday
3,Manish,36.5,Unknown,B.Tech,58.5,2021-01-04,2021.0,January,4.0,Monday
4,Praveen,27.0,Unknown,B.Tech,43.0,2020-06-20,2020.0,June,20.0,Saturday


In [80]:
df.describe()

Unnamed: 0,Age,Score,start_year,start_date
count,2050.0,2050.0,2050.0,2050.0
mean,41.377073,60.227805,2021.047805,15.654146
std,11.047582,22.415593,0.877381,8.877612
min,22.0,20.0,2020.0,1.0
25%,32.0,41.125,2020.0,8.0
50%,41.5,61.0,2021.0,15.0
75%,51.0,79.0,2022.0,23.0
max,60.0,100.0,2023.0,31.0
