This notebook demonstrates a complete end-to-end data cleaning workflow using the Pandas library. The goal was to transform a raw dataset containing inconsistencies and missing values into a structured format ready for analysis.


Key Actions Taken: Structural Auditing: Identified the dataset's shape and verified column data types.


Data Standardization: * Renamed the County column to Country for better clarity.


Standardized categorical entries for countries (e.g., correcting "Eng" and "English" into "England")


Corrected spelling errors in the Name column (e.g., "Charlote" to "Charlotte").


Handling Missing Values (Imputation): Calculated the mean age and mean height to fill missing records for participants like Luna, Liam, and Mateo. Manually corrected missing Gender entries to ensure data completeness.


Feature Engineering & Removal: Dropped the Weight column as it was no longer required for the specific scope of this analysis.


Visual Analysis: Generated a multi-colored bar chart using Matplotlib to visualize the distribution of participants across England, Scotland, and Wales.

In [None]:
#import pandas as pd
import pandas as pd

In [None]:
#loading file directly from my local project folder
sales_data = pd.read_csv("F7 Python Chalenge Data.csv")
sales_data

In [None]:
#find the shape of the data (how many rows and columns are there)
sales_data.shape

In [None]:
#find the data types in the file
sales_data
print(sales_data.dtypes)


In [None]:
#rename the 'County' column to 'Country'

sales_data.rename(columns = {"County" : "Country"}, inplace = True)
sales_data

In [None]:
#correct the spelling of 'Charlote' to 'Charlotte'
sales_data["Name"] = sales_data["Name"].replace(["Charlote"], "Charlotte")

In [None]:
#check data
sales_data

In [None]:
#find the average (mean) age
sales_data["Age"].mean()

In [None]:
sales_data

In [None]:
#replace the missing ages with the average age

sales_data.loc[8] = {"Name" : "Luna",
                     "Age" : 37.25,
                     "Height" : 0,
                     "Weight" : 80.0,
                     "Gender" : "F",
                     "Country" : "Scotland"}

sales_data.loc[9] = {"Name" : "Liam",
                     "Age" : 37.25,
                     "Height" : 175.0,
                     "Weight" : 80.0,
                     "Gender" : "M",
                     "Country" : "Scotland"}

sales_data.loc[16] = {"Name" : "Mateo",
                     "Age" : 37.25,
                     "Height" : 178.0,
                     "Weight" : 70.0,
                     "Gender" : "M",
                     "Country" : "England"}
sales_data

In [None]:
#find the average (mean) height (no decimal places)
mean_height = sales_data["Height"].mean()
print(format(mean_height, ".0f"))

In [None]:
#replace the missing heights with the average
sales_data.loc[8] = {"Name" : "Luna",
                     "Age" : 37.25,
                     "Height" : 165.0,
                     "Weight" : 80.0,
                     "Gender" : "F",
                     "Country" : "Scotland"}
sales_data

In [None]:
#make sure the countries are 'England', 'Scotland' and 'Wales'
sales_data.loc[4] = {"Name" : "Amelia",
                     "Age" : 25.00,
                     "Height" : 170.0,
                     "Weight" : 55.0,
                     "Gender" : "F",
                     "Country" : "England"}

sales_data.loc[6] = {"Name" : "Ava",
                     "Age" : 58,
                     "Height" : 172,
                     "Weight" : 50.0,
                     "Gender" : "F",
                     "Country" : "England"}

sales_data.loc[10] = {"Name" : "Oliver",
                     "Age" : 35,
                     "Height" : 180.0,
                     "Weight" : 84.0,
                     "Gender" : "M",
                     "Country" : "England"}

sales_data.loc[12] = {"Name" : "Theodore",
                     "Age" : 18,
                     "Height" : 172.0,
                     "Weight" : 90.0,
                     "Gender" : "M",
                     "Country" : "Wales"}

sales_data.loc[13] = {"Name" : "Lucas",
                     "Age" : 25,
                     "Height" : 180.0,
                     "Weight" : 67.0,
                     "Gender" : "M",
                     "Country" : "Scotland"}
sales_data

In [None]:
#delete (drop) the weight collumn
sales_data.drop(columns = "Weight", inplace = True)
sales_data

In [None]:
#set Jame's gender to 'M'
sales_data.loc[15] = {"Name" : "James",
                     "Age" : 60,
                     "Height" : 188.0,
                     "Gender" : "M",
                     "Country" : "Scotland"}
sales_data

In [None]:
#delete the row containing williams details using .drop (index = n, inplace = True)
sales_data.drop(index = 18, inplace = True)
sales_data


In [None]:
#Find the total age
sales_data["Age"].sum()

In [None]:
#find the maximum height
sales_data["Height"].max()

In [None]:
#find the minimum height
sales_data["Height"].min()

In [None]:
#check there are no null values left in the table

sales_data["Name"].isnull().sum()
sales_data["Age"].isnull().sum()
sales_data["Height"].isnull().sum()
sales_data["Gender"].isnull().sum()
sales_data["Country"].isnull().sum()

In [None]:
#create a vertical bar chart - plotting the country against the quantity, add a suitable ttle and axis names

import numpy as np #can do a lot of mathematical work with numpy
import matplotlib.pyplot as plt #this libary is huge we have only imported a section of it which is pyplot
import os #operating system, allows us to draw the graphs on the computer

In [None]:
#bar chart (usually refers to categorical data)
x_axis = np.array(["England", "Scotland", "Wales"])
y_axis = np.array([11,5,2])
plt.bar(x_axis,y_axis, color = ["Red", "Blue", "Green"])
plt.xlabel ("Country")
plt.ylabel ("Quantity")
plt.title ("Country Table")
plt.show()