# __Project 1__

## Calculation of Mean, Median and Mode with Pandas

__Step 1: Importing pandas and plotly__

In [56]:
import pandas as pd

__Step 2: Load the CSV file__

In [57]:
df = pd.read_csv('EPI Data Library - Unemployment.csv')
df.head()

Unnamed: 0,Date,All
0,Aug-2024,3.9%
1,Jul-2024,3.9%
2,Jun-2024,3.8%
3,May-2024,3.8%
4,Apr-2024,3.7%


__Step 3: Clean the Data__

Data cleaning involves renaming the columns to ensure the right information is represented through them, this is followed by splitting the 'Date' column into Month and Year. Next, we eliminate the % sign from the Percentage Unemployed column and convert all values to numeric. 

In [58]:
df = df.rename(columns={'Date': 'Date', 'All': 'Percentage Unemployed'})
df.head()

Unnamed: 0,Date,Percentage Unemployed
0,Aug-2024,3.9%
1,Jul-2024,3.9%
2,Jun-2024,3.8%
3,May-2024,3.8%
4,Apr-2024,3.7%


In [59]:
df[['Month', 'Year']] = df['Date'].str.split('-', expand=True)
df.head()

Unnamed: 0,Date,Percentage Unemployed,Month,Year
0,Aug-2024,3.9%,Aug,2024
1,Jul-2024,3.9%,Jul,2024
2,Jun-2024,3.8%,Jun,2024
3,May-2024,3.8%,May,2024
4,Apr-2024,3.7%,Apr,2024


In [60]:
df['Percentage Unemployed'] = df['Percentage Unemployed'].str.rstrip('%')

In [61]:
df['Percentage Unemployed'] = pd.to_numeric(df['Percentage Unemployed'])
df.head()

Unnamed: 0,Date,Percentage Unemployed,Month,Year
0,Aug-2024,3.9,Aug,2024
1,Jul-2024,3.9,Jul,2024
2,Jun-2024,3.8,Jun,2024
3,May-2024,3.8,May,2024
4,Apr-2024,3.7,Apr,2024


__Step 4: Calculation of Mean, Median and Mode__ 

In-built functions for mean, median and mode are used to calculate these statistics

In [62]:
Mean = df["Percentage Unemployed"].mean()
print(Mean)

6.102185792349727


In [63]:
Mode = df["Percentage Unemployed"].mode()
print(Mode)

0    5.3
Name: Percentage Unemployed, dtype: float64


In [64]:
Median = df["Percentage Unemployed"].median()
print(Median)

5.8


## __Calculation of Mean without Pandas__

For calculating the mean, the dataframe is first converted to a list, after that the sum of all the values is estimated and the mean is calculated by dividing the sum by the number of instances of a variable

In [75]:
percentage_unemployed = df['Percentage Unemployed'].tolist()
for item in percentage_unemployed:
    float(item)

In [103]:
unemployed_sum = sum(percentage_unemployed)

In [93]:
del len

In [94]:
unemployed_no = len(percentage_unemployed)

In [102]:
mean_unemployment = unemployed_sum / unemployed_no
print('Mean is: ' + str(mean_unemployment))

Mean is: 6.102185792349727


## __Calculation of Median Without Pandas__


The median is calculated in the following steps:
First, the number of elements (n) are counted and then the list of the valys is sorted. If the list length is even, the median is calculated as the average of the two middle values. If the length is odd, the median is the middle value directly. Finally, the computed median is printed. 

In [100]:
n = len(percentage_unemployed) 
percentage_unemployed.sort() 
 
if n % 2 == 0: 
    median1 = percentage_unemployed[n//2] 
    median2 = percentage_unemployed[n//2 - 1] 
    median = (median1 + median2)/2
else: 
    median = percentage_unemployed[n//2] 
print('Median is: ' + str(median)) 

Median is: 5.8


## __Calculation of Mode Without Pandas__

To calculate the mode, the values in the dataset are first counted using a dictionary, where each unique value is mapped to its frequency of occurrence. Once the frequencies are determined, the maximum frequency is identified, and all values corresponding to this frequency are extracted as the mode. The final mode is printed.

In [113]:
# Step 1: Count the occurrences of each value using a dictionary
frequency_dict = {}
for value in percentage_unemployed:
    if value in frequency_dict:
        frequency_dict[value] += 1
    else:
        frequency_dict[value] = 1

# Step 2: Find the maximum frequency
max_frequency = max(frequency_dict.values())

# Step 3: Find all values with the maximum frequency
modes = [key for key, freq in frequency_dict.items() if freq == max_frequency]

# Print results
print('Mode is:' + str(modes))

Mode is:[5.3]


## __Data Visualisation__


_The first step in the data visualisation process is to group the data by year and consolidate it_

In [126]:
grouped_df = df.groupby('Year')['Percentage Unemployed'].mean()
grouped_df.head()

Year
1978    6.100000
1979    5.908333
1980    6.533333
1981    7.550000
1982    8.691667
Name: Percentage Unemployed, dtype: float64

_The next step is to create a function for the sparkline visualisation. Inside the function, the data is converted to a dictionary for easier iteration, and the minimum, maximum, and range of values are calculated. For each data point (e.g., each year), it determines the number of stars to represent the value proportionally. If all data points have the same value, a default of 5 stars is assigned; otherwise, the value is normalized relative to the range of the dataset, scaled to a range of 0–10, and converted to a corresponding number of stars. Each year and its sparkline are printed._

In [125]:
# Use the grouped DataFrame as input
data = grouped_df

# Function to create sparkline
def create_sparkline(data):
    # Convert the Pandas Series into a dictionary
    data_dict = data.to_dict()

    # Find the minimum and maximum values
    min_value = min(data_dict.values())
    max_value = max(data_dict.values())
    range_value = max_value - min_value

    # Generate stars proportional to the values
    for year, value in data_dict.items():
        if range_value == 0:  # Handle case where all values are the same
            stars = '*' * 5
        else:
            stars = '*' * int(((value - min_value) / range_value) * 10)
        print(f"{year}: {stars}")

# Call the function
create_sparkline(grouped_df)

1978: ***
1979: ***
1980: ****
1981: ******
1982: *******
1983: **********
1984: *******
1985: *****
1986: *****
1987: ****
1988: ***
1989: **
1990: **
1991: ****
1992: *****
1993: *****
1994: ****
1995: ***
1996: ***
1997: **
1998: *
1999: *
2000: 
2001: *
2002: **
2003: ***
2004: ***
2005: **
2006: *
2007: *
2008: **
2009: ******
2010: *********
2011: ********
2012: *******
2013: ******
2014: ****
2015: ***
2016: **
2017: *
2018: 
2019: 
2020: ***
2021: *****
2022: 
2023: 
2024: 
