# Data Inconsistencies

Data inconsistencies refer to errors or discrepancies in data that can affect the accuracy and reliability of data analysis, and may include issues such as missing values, outliers, duplicates, and formatting errors.

1. Inconsistent Formates
   - Difference region have different formates of dates that is DD-MM-YYYY or DD/MM/YYYY or YYYY-DD-MM.
   - We convert different formates into single formate.
2. Inconsistent Naming Convetions
   - USA, United States, US or United States of America.
   - We convert the inconsistent different names to one consistent formate. 
3. Typographical Errors
   - Mistakes in data entry.
   - Pakistan -> Paakistan or pakistan. 
4. Duplication 
5. Contradictory
   - let's suppose we have son_age and father_age. Logically we have son_age < father_age but what if son_age > father_age then it is data contradictory. It's not possible or very rare.

In [1]:
# importing library
import pandas as pd

In [19]:
data = {
    'date': ['2021-12-01', '01-12-2022', '2022/12/01', '12-01-2021'],
    'country': ['USA', 'U.S.A.', 'America', 'United States'],
    'name': ['Aammar', 'Amaar', 'Hamza', 'Hazma'],
    'sales_2020': [100, 200, None, 200],
    'sales_2021': [None, 150, 300, 150]
}
# make pandas dataframe
df = pd.DataFrame(data)
df.head()

Unnamed: 0,date,country,name,sales_2020,sales_2021
0,2021-12-01,USA,Aammar,100.0,
1,01-12-2022,U.S.A.,Amaar,200.0,150.0
2,2022/12/01,America,Hamza,,300.0
3,12-01-2021,United States,Hazma,200.0,150.0


In [9]:
# standardizing the date format
df['date'] = pd.to_datetime(df['date'], errors='coerce')
df['date'] = df['date'].dt.strftime("%Y-%m-%d")
df.head()

Unnamed: 0,date,country,name,sales_2020,sales_2021
0,2021-12-01,USA,Aammar,100.0,
1,,U.S.A.,Amaar,200.0,150.0
2,,America,Hamza,,300.0
3,,United States,Hazma,200.0,150.0


Impute fill the date and try another method to complete the date format inconsitencies

In [13]:
# # fill NaT values with the first day of the year based on the year of the NaT value
# df['date'] = df['date'].fillna(df['date'].dt.year.apply(lambda x: pd.to_datetime(f"{x}-01-01", format="%Y-%m-%d")))

# # convert date column to desired format
# df['date'] = df['date'].dt.strftime("%Y-%m-%d")
# df.head()

In [14]:
# Harmonize the name of the coutry
country_mapping = {'USA': 'United States', 'U.S.A.': 'United States', 'America': 'United States'}
df['country'] = df['country'].replace(country_mapping)
df.head()

Unnamed: 0,date,country,name,sales_2020,sales_2021
0,2021-12-01,United States,Aammar,100.0,
1,NaT,United States,Amaar,200.0,150.0
2,NaT,United States,Hamza,,300.0
3,NaT,United States,Hazma,200.0,150.0


In [15]:
# Correct the typographical Mistakes in name
# Let's assume we want to correct 'Jonh Doe' to 'John Doe'
df['name'] = df['name'].replace({'Amaar': 'Aammar', 'Hazma': 'Hamza'})
df.head()

Unnamed: 0,date,country,name,sales_2020,sales_2021
0,2021-12-01,United States,Aammar,100.0,
1,NaT,United States,Aammar,200.0,150.0
2,NaT,United States,Hamza,,300.0
3,NaT,United States,Hamza,200.0,150.0


In [16]:
# remove duplicates
df = df.drop_duplicates()
df.head()

Unnamed: 0,date,country,name,sales_2020,sales_2021
0,2021-12-01,United States,Aammar,100.0,
1,NaT,United States,Aammar,200.0,150.0
2,NaT,United States,Hamza,,300.0
3,NaT,United States,Hamza,200.0,150.0


In [17]:
# remove duplicates from specific column
df = df.drop_duplicates(subset="name")
df.head()

Unnamed: 0,date,country,name,sales_2020,sales_2021
0,2021-12-01,United States,Aammar,100.0,
2,NaT,United States,Hamza,,300.0


In [20]:
# 5. Resolving Contradictory Data
# For demonstration, let's assume sales_2021 should always be higher than sales_2020
# We'll remove rows where this condition is not met
df = df.drop(df[df['sales_2021'] <= df['sales_2020']].index)
df.head()

Unnamed: 0,date,country,name,sales_2020,sales_2021
0,2021-12-01,USA,Aammar,100.0,
2,2022/12/01,America,Hamza,,300.0
