# Data Cleaning and Initial Analysis

The first stage is to clean the data and make sure it is fit for purpose. Examine the data careful to identify anomalies and then consider how your program can identify these and correct/delete or change. You will need to consider how you are going to handle erroneous or missing values. You should output a sample of the data that demonstrates how cleaning has changed the data.

The next stage is to reshape the data as per any requirements of the brief (or your own scenario). Is all the data needed to provide the required results? Is any of it duplicated? Is there data across different sources that needs to be brought together? Again, output a sample to the console to demonstrate how this has changed the structure of the data.

Finally develop and test a set of functions (or objects and methods) that applies the statistical analysis to the data set, outputting the results to the console.

Capture the results of your data cleaning, shaping and functions with screenshots of the consol. There is no requirement at this stage for anything to be functioning in through the GUI. Make sure it is clear what your output is testing/demonstrating (output simple informative statements).

In [205]:
import pandas as pd
import numpy as np
import pymongo
from rapidfuzz import process, fuzz

class CleaningFunctions:
    '''Class containing related data cleaning functions'''
    @staticmethod
    def _map_to_most_similar(values, counts, threshold=80):
        """
        Map each value to the most frequent one among its close matches.
        """
        result_mapping = {}
        for value in values:
            matches = process.extract(value, values, scorer=fuzz.ratio, limit=None)
            # Filter matches by threshold
            similar = [(match, counts[values.index(match)]) for match, score, _ in matches if score >= threshold]
            if similar:
                # Find the most frequent one
                most_frequent = max(similar, key=lambda x: x[1])[0]
                result_mapping[value] = most_frequent
            else:
                # No similar match; keep the value itself
                result_mapping[value] = value
        return result_mapping

    @staticmethod
    def tidy_col_values(column: pd.Series, whitespace_to_nan=True, strip_whitespace=True,
                        normalise_case: str | None = 'title', fuzzy_threshold=80):
        '''Function that normalises column values'''
        cases = {'title': column.str.title,
                'lower': column.str.lower,
                'upper': column.str.upper}
        
        if whitespace_to_nan:
            # Convert missed null values
            column = column.replace([r'^\s+$', '^$'], np.nan, regex=True)

        if strip_whitespace:
            # Strip whitespace from values
            column = column.str.strip()

        if normalise_case:
            if normalise_case not in cases:
                raise KeyError(f"'{normalise_case}' not accepted value for normalise_case. Please choose from {list(cases.keys())}")
            
            # Normalise case
            column = cases[normalise_case]()

        # NLP to match similar value names (e.g. Submission_state and Submission_status)
        column = column.replace(CleaningFunctions._map_to_most_similar(column.unique().tolist(), 
                                                    column.value_counts().tolist(),
                                                    threshold=fuzzy_threshold))

        return column
    
    def reshaping(obj: pd.DataFrame | pd.Series, reshape: tuple[int, int] | None = None, transpose=False):
        pass



client = pymongo.MongoClient('localhost', 27017)
db = client['summative']

act_log, comp_codes, user_log = [db[name] for name in db.list_collection_names()]
act_log_df = pd.DataFrame(act_log.find({})).set_index('_id')

In [204]:
temp_df = act_log_df.copy()
for col in act_log_df.columns[1:]:
    temp_df[col] = CleaningFunctions.tidy_col_values(temp_df[col], normalise_case='lower', fuzzy_threshold=70)

(4, 4)