Here are some examples of some best practices to look out for when cleaning data.
I'll use some DataFrames that are made up. But if you want to see how this can be handled, I also have a repository on how to use Reddit API, clean and analyze the data

In [1]:
#Examples of how to clean data
import pandas as pd
import numpy as np

How to Handle Missing Values

In [2]:
# Sample DataFrame
data = {
    'Product': ['Solar Panel', 'Inverter', 'Battery', 'Solar Panel', np.nan],
    'Price': [250, 500, np.nan, 300, 450],
    'Quantity': [10, 5, 8, np.nan, 7]
}
df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)

# Handling Missing Values

# 1. Removing rows with any missing values
df_cleaned = df.dropna()
print("\nDataFrame after dropping rows with missing values:")
print(df_cleaned)

# 2. Imputing missing values
df_imputed = df.copy()
df_imputed['Product'].fillna('Unknown', inplace=True)
df_imputed['Price'].fillna(df_imputed['Price'].mean(), inplace=True)
df_imputed['Quantity'].fillna(df_imputed['Quantity'].median(), inplace=True)
print("\nDataFrame after imputing missing values:")
print(df_imputed)

# There are different ways to handle missing values, such as using the mean, median, mode, or a constant value to fill in the missing values.
# You can also use more advanced techniques like interpolation or machine learning algorithms to impute missing values.
# It's important to consider the nature of the data and the impact of different imputation methods on the analysis.
# In some cases, it may be appropriate to drop rows or columns with missing values if they are not critical to the analysis.
# Always carefully evaluate the implications of handling missing values in your data.
# And also make sure to let the reader(either of the code or others on your project team) know how you handled missing values in your analysis.

Original DataFrame:
       Product  Price  Quantity
0  Solar Panel  250.0      10.0
1     Inverter  500.0       5.0
2      Battery    NaN       8.0
3  Solar Panel  300.0       NaN
4          NaN  450.0       7.0

DataFrame after dropping rows with missing values:
       Product  Price  Quantity
0  Solar Panel  250.0      10.0
1     Inverter  500.0       5.0

DataFrame after imputing missing values:
       Product  Price  Quantity
0  Solar Panel  250.0      10.0
1     Inverter  500.0       5.0
2      Battery  375.0       8.0
3  Solar Panel  300.0       7.5
4      Unknown  450.0       7.0


How to Handle Duplicates

In [3]:
# How to Handle Duplicates
# Sample DataFrame with duplicates
data = {
    'OrderID': [101, 102, 103, 101, 104],
    'Product': ['Solar Panel', 'Inverter', 'Battery', 'Solar Panel', 'Battery'],
    'Price': [250, 500, 300, 250, 300],
    'Quantity': [10, 5, 8, 10, 8]
}
df_duplicates = pd.DataFrame(data)

print("DataFrame with Duplicates:")
print(df_duplicates)

# Removing duplicate rows
df_no_duplicates = df_duplicates.drop_duplicates()
print("\nDataFrame after removing duplicates:")
print(df_no_duplicates)

# In some cases, duplicate rows may be valid and should not be removed.
# It's important to carefully consider the context of the data and the impact of removing duplicates on the analysis.
# Again, it's important to communicate how duplicates were handled in your analysis to ensure transparency and reproducibility.


DataFrame with Duplicates:
   OrderID      Product  Price  Quantity
0      101  Solar Panel    250        10
1      102     Inverter    500         5
2      103      Battery    300         8
3      101  Solar Panel    250        10
4      104      Battery    300         8

DataFrame after removing duplicates:
   OrderID      Product  Price  Quantity
0      101  Solar Panel    250        10
1      102     Inverter    500         5
2      103      Battery    300         8
4      104      Battery    300         8


How to Handle different or Incorrect Data Types

In [4]:
# How to Handle Incorrect Data Types
# Sample DataFrame with incorrect data types
data = {
    'OrderID': ['101', '102', '103', '104'],
    'Product': ['Solar Panel', 'Inverter', 'Battery', 'Battery'],
    'Price': ['250', '500', '300', '450'],
    'Quantity': ['10', '5', '8', '7']
}
df_types = pd.DataFrame(data)
print("DataFrame with Incorrect Data Types:")
print(df_types.dtypes)

# Correcting data types
df_types['OrderID'] = df_types['OrderID'].astype(int)
df_types['Price'] = df_types['Price'].astype(float)
df_types['Quantity'] = df_types['Quantity'].astype(int)

print("\nDataFrame after Correcting Data Types:")
print(df_types.dtypes)

DataFrame with Incorrect Data Types:
OrderID     object
Product     object
Price       object
Quantity    object
dtype: object

DataFrame after Correcting Data Types:
OrderID       int64
Product      object
Price       float64
Quantity      int64
dtype: object


How to Handle Outliers/Inconsistent Data

In [6]:
# How to Handle Outliers/Inconsistent Data
# Sample DataFrame with outliers
data = {
    'Sales': [100, 150, 200, 250, 300, 1000, 350, 400]
}
df_outliers = pd.DataFrame(data)

print("DataFrame with Outliers:")
print(df_outliers)

# Detecting outliers using Z-score
from scipy import stats

df_outliers['Z_Score'] = stats.zscore(df_outliers['Sales'])
print("\nDataFrame with Z-Scores:")
print(df_outliers)

# Removing outliers with Z-score > 2, or < -2, but you can choose the threshold
df_no_outliers = df_outliers[abs(df_outliers['Z_Score']) < 2]
print("\nDataFrame after Removing Outliers:")
print(df_no_outliers.tail())

# In this case, we removed the Sale of 1000 as it was considered an outlier based on the Z-score.
# It's important to carefully consider the context of the data and the impact of removing outliers on the analysis.
# Outliers may contain valuable information or indicate errors in the data collection process.
# Again, as said a few times previously, document and communicate the criteria used to detect and remove outliers.


DataFrame with Outliers:
   Sales
0    100
1    150
2    200
3    250
4    300
5   1000
6    350
7    400

DataFrame with Z-Scores:
   Sales   Z_Score
0    100 -0.919494
1    150 -0.730880
2    200 -0.542266
3    250 -0.353652
4    300 -0.165037
5   1000  2.475561
6    350  0.023577
7    400  0.212191

DataFrame after Removing Outliers:
   Sales   Z_Score
2    200 -0.542266
3    250 -0.353652
4    300 -0.165037
6    350  0.023577
7    400  0.212191
