# Introduction to the Practical

In this practical, we will demonstrate how to investigate, clean, and prepare a dataset, along with a analysing the complexity of algorithms. This is to ensure the data is accurate, consistentm and ready for any further analysis or modelling.

We are limited to the libraries that can be utilised to conduct the data analysis hence the following libraries will be used:

1. numpy (np) - This is for numerical operations and handling arrays etc.
2. time - This is to track the execution track of algorithms and assess their time complexity.
3. os - This is to interact with the operating system, especially to check memory usage or file sizes.
4. garbage collection (gc) - This is to manage the memory usage and clean objects that are no longer in use.
5. pandas (pd) - This is for data manipulation and analysis.
6. psutil - This is to monitor system resources like CPU and memory being used.

We will also be utilising GIT as version control where code will be regularly committed and maintain. All git logs will be attachec with the submission.



## Stage 1 - Importing required Libraries and loading the data set



Import the libraries:

In [7]:
import pandas as pd
import numpy as np
import time
import gc
import psutil
import os


Load the Dataset

In [15]:
# Defining the dataset file path
dataset_filepath = 'dataset.csv'

# Load the data set now as a pandas frame
data = pd.read_csv(dataset_filepath, header=None)

# Displaying the first 5 rows to see if the data is being shown successfully
data.head()


Unnamed: 0,0
0,Level\tT4\tT3\tT3adjusted\tT4adjusted
1,5\t8.1\t2.1\t2.00829885\t1.280579165
2,5\t8.7\t\t2.056710116\t
3,10\t3.5\t1.6\t1.518294486\t1.169607095
4,20\t7.9\t4.6\t1.991631701\t1.663103499


### Stage 2 - Data Cleaning
In this section we are going to focus on cleaning the data set.

In [14]:
# Cleaning up the file 
# Split based on the tab character '\t' This will split each row into a seperate column wherever a tab is encountered.
# It essentially treats the file as a single column at first then splits it into multiple columns
data_cleaned = data[0].str.split('\t', expand=True)

# Assign column names from the first row and remove the header row from the data
data_cleaned.columns = data_cleaned.iloc[0] # Assign first row of the data as the actual column headers
data_cleaned = data_cleaned.drop(0).reset_index(drop=True)

# Convert relevant columns to numeric. The data once split is in string format, to perform statistical operations etc we need to ensure its numeric
data_cleaned = data_cleaned.apply(pd.to_numeric)

# Display the cleaned dataframe to inspect if the data is being shown as expected
data_cleaned.head()

Unnamed: 0,Level,T4,T3,T3adjusted,T4adjusted
0,5,8.1,2.1,2.008299,1.280579
1,5,8.7,,2.05671,
2,10,3.5,1.6,1.518294,1.169607
3,20,7.9,4.6,1.991632,1.663103
4,30,2.3,0.4,1.320006,0.736806


### Stage 3 - Handling Missing Values Case
In this section we will idenitfy and replace any missing values in columns T3 and T4 using average values specific to their level

In [30]:
# Replacing Missing Values in T3 and T4 with Mean Values Grouped by 'Level'

# Drop rows where 'Level' is missing or non-numeric (if applicable)
data_cleaned = data_cleaned[data_cleaned['Level'].notnull()]
data_cleaned['Level'] = pd.to_numeric(data_cleaned['Level'], errors='coerce')

# Remove rows where 'Level' is still NaN (if any) after conversion
data_cleaned = data_cleaned.dropna(subset=['Level'])


# Fill missing values in T3 and T4 with group means based on the 'Level' column
data_cleaned['T3'] = data_cleaned.groupby('Level')['T3'].transform(lambda x: x.fillna(x.mean()))
data_cleaned['T4'] = data_cleaned.groupby('Level')['T4'].transform(lambda x: x.fillna(x.mean()))

# Display the cleaned data to verify that missing values have been filled
data_cleaned.head()

Unnamed: 0,Level,T4,T3,T3adjusted,T4adjusted
0,5,8.1,2.1,2.008299,1.280579
1,5,8.7,4.029113,2.05671,
2,10,3.5,1.6,1.518294,1.169607
3,20,7.9,4.6,1.991632,1.663103
4,30,2.3,0.4,1.320006,0.736806


We have been able to replace missing values in T3 and T4 with mean values grouped by their specific level. 

### Stage 4 - Descriptive Statistics
We will be calulating basic descriptive statistics for each column.

In [32]:
# Calculating Descriptive statistics manually

def descriptive_statistics(df):
    statistics = pd.DataFrame({
        'Sum': df.sum(),
        'Mean': df.mean(),
        'Median': df.median(),
        'Mode': df.mode().iloc[0],  # Mode might return multiple values, so we take the first one
        'StdDev': df.std(),
        'Variance': df.var(),
        'Range': df.max() - df.min(),
        'Min': df.min(),
        'Max': df.max(),
        'Unique Values': df.nunique()  # Count of unique values
    })
    return statistics

We are utilising an array of different statistics to allow a more comprehensive analysis of the data set.
The statistics that are used are as follows:
1. **Sum** - The total sum of all values in the column
2. **Mean** - The avergae of all values in the column
3. **Median** - The middle value in the sorted list of all values
4. **Mode** - The most frequently occuring value
5. **StdDev** - A measure of how spread out the values are. This quantifies the variation in the data. A higher STDev means data is more spread out from the average and vice versa.
6. **Variance** - The square of the StdDev, this gives an actual value to how much the numbers vary from the average of all values.
7. **Range** - The difference between the maximum and the minimum value
8. **Min** - The smallest value in the data set
9. **Max** - The largest value in the data set
10. **Unique Values** - The number of distinct values present in the data set

In [36]:
# Lets compute the statistics for our data

manual_statistics = descriptive_statistics(data_cleaned)

manual_statistics

Unnamed: 0_level_0,Sum,Mean,Median,Mode,StdDev,Variance,Range,Min,Max,Count,Unique Values
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Level,15605.0,32.375519,20.0,20.0,58.158773,3382.442892,395.0,5.0,400.0,482,14
T4,7402.319873,15.35751,11.241357,12.454915,19.757047,390.340903,279.6,-44.4,235.2,482,396
T3,6115.748987,12.688276,8.2,4.029113,14.923203,222.701975,186.8,-38.4,148.4,482,314
T3adjusted,962.530045,2.022122,2.217399,0.0,1.274283,1.623798,9.71377,-3.541014,6.172756,476,392
T4adjusted,701.413786,1.890603,2.0,0.0,1.24563,1.551593,8.668064,-3.373731,5.294334,371,306


### Stage 5 - Compare manual statistics to pandas.describe()
We will compare the manually calulcated stats to the ones automatically generated by pandas.descrive func

In [37]:

# Calculate descriptive statistics using pandas.describe()
pandas_stats = data_cleaned.describe()

# Display both the manually calculated stats and the pandas.describe() results
print("Manually Calculated Descriptive Statistics:")
print(manual_statistics)

print("\nDescriptive Statistics from pandas.describe():")
print(pandas_stats)


Manually Calculated Descriptive Statistics:
                     Sum       Mean     Median       Mode     StdDev  \
0                                                                      
Level       15605.000000  32.375519  20.000000  20.000000  58.158773   
T4           7402.319873  15.357510  11.241357  12.454915  19.757047   
T3           6115.748987  12.688276   8.200000   4.029113  14.923203   
T3adjusted    962.530045   2.022122   2.217399   0.000000   1.274283   
T4adjusted    701.413786   1.890603   2.000000   0.000000   1.245630   

               Variance       Range        Min         Max  Count  \
0                                                                   
Level       3382.442892  395.000000   5.000000  400.000000    482   
T4           390.340903  279.600000 -44.400000  235.200000    482   
T3           222.701975  186.800000 -38.400000  148.400000    482   
T3adjusted     1.623798    9.713770  -3.541014    6.172756    476   
T4adjusted     1.551593    8.668064  

We have now successfully comapred our results with the pandas.describe() 


### Step 6 - Identify any repeated rows and confirmation
Lets write logic to identify any repeated code and confirmation if there are none

In [38]:
# Check for duplicates
duplicates = data_cleaned[data_cleaned.duplicated()]

# Display duplicates (if any)
duplicates

Unnamed: 0,Level,T4,T3,T3adjusted,T4adjusted
120,20,14.7,17.169391,2.44966,
206,40,18.416388,20.410854,,
252,5,18.4,12.1,2.640012,2.29577
274,10,19.8,11.4,2.705339,2.250617
334,15,31.0,13.193684,3.141381,
359,20,19.5,15.6,2.691606,2.498666
466,200,12.0,17.783333,2.289428,
467,200,16.3,37.3,2.535494,3.341204
468,100,27.8,42.3,3.029342,3.484283
469,200,12.9,9.0,2.34529,2.080084


## Conclusion
In this notebook, we:
- Cleaned and transformed a tab-separated dataset.
- Replaced missing values with group-level averages.
- Calculated descriptive statistics manually.
- Compared our manually generated statistics with pandas.describe() ones
- Checked for duplicate rows in the data.

