# Project Group 3 - Mushroom Dataset
Team Members:

1. Uday Kiran Lakkineni 

2. Srimanth Madira 

3. Prathyusha Mekala

4. Mrunmay Sandeep 

5. Laxman Yadav Musti

## Import Statements

In [3]:
import pandas as pd 
import numpy as np 
from ucimlrepo import fetch_ucirepo 
import random

In [14]:
# fetch dataset 
mushroom = fetch_ucirepo(id=73) 
  
# data (as pandas dataframes) 
X = mushroom.data.features 
y = mushroom.data.targets 
  
# variable Information
url = "https://archive.ics.uci.edu/static/public/73/data.csv"

# Read the CSV File into dataframe
df = pd.read_csv(url)

## Data Exploration Task 1:

The mushroom dataset, derived from the Audobon Society Field Guide, aims to classify various gilled mushrooms into edible or 
poisonous categories. Each feature captures specific physical characteristics crucial for this classification task. 
Understanding the interplay of these features is essential for building an accurate model to distinguish between edible and 
poisonous mushrooms.

Explanation of the purpose of each feature in relation to all other features:

1) Cap-shape: The "cap-shape" feature characterizes the overall silhouette of the mushroom cap, encompassing distinctions    like bell (b), conical (c), convex (x), flat (f), knobbed (k), and sunken (s) caps. This feature holds significance in understanding various mushroom characteristics, influencing factors such as gill attachment and stalk shape, as specific cap shapes may correlate with distinct features throughout the dataset.
    
2) Cap-surface: The "cap-surface" feature classifies the mushroom cap texture into fibrous (f), grooves (g), scaly (y), 
and smooth (s). It plays a crucial role in understanding surface characteristics, potentially influencing the likelihood of bruises and showing associations with color or odor. In summary, "cap-surface" is a key descriptor contributing to a comprehensive understanding of the mushroom's physical attributes and their interconnections with other features in the dataset.
    
3) Cap-Color: "Cap-color" specifies the hue of the mushroom cap, offering choices such as brown (n), buff (b), cinnamon (c),
gray (g), green (r), pink (p), purple (u), red (e), white (w), and yellow (y). This feature serves as a vital marker, influencing characteristics like gill and stalk color. Essentially, it plays a key role in differentiating and comprehending diverse aspects of the mushroom's visual profile.
       
4) Bruises: The "bruises" feature is binary, signifying the presence (t) or absence (f) of bruises on the mushroom. This          attribute, indicative of the mushroom's overall health, might be correlated with other features such as cap color or odor. In essence, "bruises" provides insights into potential connections between external characteristics and the well-being of the mushroom. 
    
5) Odor: The "odor" feature characterizes the scent of the mushroom, with choices like almond (a), anise (l), creosote (c),      fishy (y), foul (f), musty (m), none (n), pungent (p), and spicy (s). This attribute holds crucial significance, as specific odors can indicate the mushroom's edibility or toxicity. The smell, being a pivotal feature, strongly influences the classification of the mushroom.
    
6) Gill- Attachment: The "gill-attachment" feature defines how the gills are connected to the stem, with options including        attached (a), descending (d), free (f), and notched (n). This attribute not only provides information on gill attachment but also holds potential associations with features like gill spacing and gill size. Such associations can impact the overall visual characteristics of the mushroom, making "gill-attachment" a significant descriptor in understanding its morphology.
    
7) Gill-Spacing: The "gill-spacing" feature characterizes the spacing between gills, with distinctions like close (c),            crowded (w), and distant (d). This attribute not only provides direct information about gill arrangement but also suggests potential relationships with gill attachment and gill size. Consequently, "gill-spacing" contributes valuable insights into the structural characteristics of the mushroom's gills.
    
8) Gill-Size: The "gill-size" feature denotes whether the gills are broad (b) or narrow (n). This characteristic, influenced      by gill spacing, may impact the mushroom's overall appearance, particularly its underside. In essence, "gill-size" contributes to understanding the visual attributes of the mushroom, providing insights into the dimensions and structure of its gills.
    
9) Gill-Color: The "gill-color" feature designates the color of the gills, with choices including black (k), brown (n), buff      (b), chocolate (h), gray (g), green (r), orange (o), pink (p), purple (u), red (e), white (w), and yellow (y). This attribute holds significance as gill color may be related to cap color, influencing the overall visual characteristics of the mushroom. In summary, "gill-color" provides valuable information for understanding the aesthetic features and potential color correlations within the dataset.
    
10) Stalk-Shape: The "stalk-shape" feature characterizes the shape of the mushroom stalk, indicating whether it is enlarging (e) or tapering (t). This attribute, influenced by cap shape, contributes insights into the overall structure of the mushroom. In essence, "stalk-shape" serves as a descriptor that aids in understanding the morphological characteristics of the mushroom's stalk in relation to other features.
    
11) Stalk-Root: The "stalk-root" feature furnishes details about the root of the mushroom stalk, encompassing options such as bulbous (b), club (c), cup (u), equal (e), rhizomorphs (z), rooted (r), and missing (?). This attribute holds importance as the stalk root may exert influence on other features like stalk surface, offering crucial insights into the mushroom's overall growth pattern. Consequently, "stalk-root" plays a pivotal role in comprehending the structural aspects of the mushroom and their potential impact on its characteristics.
    
12) Stalk-Surface-Above-Ring: The "stalk-surface-above-ring" feature characterizes the surface of the mushroom stalk above the ring, distinguishing between fibrous (f), scaly (y), silky (k), and smooth (s). This attribute's significance lies in its potential linkage to other features, such as stalk color, and its contribution to the overall appearance of the mushroom. In essence, "stalk-surface-above-ring" provides valuable insights into the textural qualities of the stalk, influencing its visual attributes within the dataset.
    
13) Stalk-Surface-below-Ring: The "stalk-surface-below-ring" feature, akin to its above-ring counterpart, characterizes the surface of the mushroom stalk but focuses on the area below the ring. It differentiates between fibrous (f), scaly (y), silky (k), and smooth (s) surfaces. This feature complements the "stalk-surface-above-ring," offering a comprehensive description of the entire stalk. Together, they provide nuanced insights into the textural qualities of both upper and lower stalk regions, contributing to a holistic understanding of the mushroom's stalk structure.
    
14) Stalk-Color-Above-Ring: The "stalk-color-above-ring" feature designates the color of the mushroom stalk above the ring, with choices such as brown (n), buff (b), cinnamon (c), gray (g), orange (o), pink (p), red (e), white (w), and yellow (y). This attribute holds significance as stalk color may be associated with cap color, influencing the overall visual impression of the mushroom. In essence, "stalk-color-above-ring" provides valuable information for understanding the aesthetic characteristics and potential color correlations within the dataset.
    
15) Stalk-Color-Below-Ring: The "stalk-color-below-ring" feature, similar to its above-ring counterpart, specifies the color of the mushroom stalk but focuses on the area below the ring. It includes choices such as brown (n), buff (b), cinnamon (c), gray (g), orange (o), pink (p), red (e), white (w), and yellow (y). Complementing the "stalk-color-above-ring," this feature contributes to the comprehensive description of the entire stalk, offering insights into the color variations throughout different sections of the mushroom's stalk.
    
16) Veil-Type: The "veil-type" feature is binary, indicating the type of veil as either partial (p) or universal (u). While this feature exhibits limited variability, it may still exert influence on other features, such as veil color, and potentially plays a role in shaping the overall appearance of the mushroom. Despite its binary nature, "veil-type" contributes subtle nuances that could impact the overall visual characteristics within the dataset.
    
17) Veil-Color: The "veil-color" feature designates the color of the mushroom veil, offering choices like brown (n), orange (o), white (w), and yellow (y). This attribute holds significance, as veil color may be linked to cap color, contributing to the overall visual characteristics of the mushroom. In essence, "veil-color" provides valuable information for understanding the aesthetic features and potential color correlations within the dataset.
    
18) Ring-Number: The "ring-number" feature indicates the number of rings on the mushroom, with options none (n), one (o), and two (t). This attribute's relevance extends to its potential relationship with other features, such as ring type, offering insights into the mushroom's reproductive structures. In essence, "ring-number" contributes valuable information for understanding the reproductive characteristics of the mushroom within the dataset.
    
19) Ring-Type: The "ring-type" feature describes the type of ring on the mushroom, providing choices like cobwebby (c), evanescent (e), flaring (f), large (l), none (n), pendant (p), sheathing (s), and zone (z). This attribute's significance lies in its potential influence by features such as cap shape, potentially contributing to the overall appearance of the mushroom. In summary, "ring-type" offers insights into the specific characteristics of the ring, contributing to a comprehensive understanding of the mushroom's visual features within the dataset.
    
20) Spore-Print-Color: The "spore-print-color" feature specifies the color of the mushroom's spore print, presenting choices such as black (k), brown (n), buff (b), chocolate (h), green (r), orange (o), purple (u), white (w), and yellow (y). This attribute's relevance extends to its potential correlation with gill color, providing additional information about the mushroom's reproductive features. In essence, "spore-print-color" contributes valuable insights into the reproductive characteristics of the mushroom within the dataset.
    
21) Population: The "population" feature describes the population of mushrooms, with options such as abundant (a), clustered (c), numerous (n), scattered (s), several (v), and solitary (y). This attribute's significance extends to its potential influence by habitat, offering insights into the ecological distribution of the mushroom. In essence, "population" contributes valuable information for understanding the mushroom's presence and dispersion within different ecological settings in the dataset.
    
22) Habitat: The "habitat" feature indicates the habitat in which the mushroom is found, encompassing choices like grasses (g), leaves (l), meadows (m), paths (p), urban (u), waste (w), and woods (d). This attribute's significance lies in its potential influence on other features, such as population, making it crucial for understanding the mushroom's ecological niche. In summary, "habitat" provides valuable information about the environments in which the mushroom thrives and contributes to a holistic understanding of its ecological distribution within the dataset.
    
    
Target Variable:

poisonous: This is the target variable, indicating whether the mushroom is poisonous (p) or edible (e).

This dataset is valuable for training machine learning models to predict whether a mushroom is safe to eat or not based on its physical characteristics. It represents a classic example of a categorical classification problem, where the goal is to categorize mushrooms into two classes: edible or poisonous.

## Data Exploration Task 2 - Imputation Function

In [15]:
def mode_imputation(data, column_name):
    data[column_name] = data[column_name].astype(str)
    data[column_name] = data[column_name].str.strip()

    data[column_name] = data[column_name].replace('nan', np.nan)

    mode_value = data[column_name].mode()[0]
    data[column_name] = data[column_name].replace(np.nan, mode_value)
    data[column_name] = data[column_name].replace('?', mode_value)

    return data

df = mode_imputation(df, 'stalk-root')

## Data Exploration Task 3 - Descriptive Statistics 

In [16]:
def calculate_descriptive_statistics(data, column_name):
    encoding_mapping = {value: index + 1 for index, value in enumerate(sorted(data[column_name].unique()))}

    encoded_column = data[column_name].map(encoding_mapping)

    min_value = encoded_column.min()
    max_value = encoded_column.max()
    mean_value = encoded_column.mean()
    median_value = encoded_column.median()
    mode_value = encoded_column.mode().iloc[0]
    
    for key, value in encoding_mapping.items():
        if value == 6:
            odor_value = key
            
    range_value = max_value - min_value
    variance_value = encoded_column.var()
    std_dev_value = encoded_column.std()
    q1_value = encoded_column.quantile(0.25)
    q2_value = encoded_column.quantile(0.5)
    q3_value = encoded_column.quantile(0.75)

    print("Descriptive Statistics of the most relevant feature:")
    print(f"\nMost Relevant Feature: {column_name}")
    print(f"Minimum: {min_value}")
    print(f"Maximum: {max_value}")
    print(f"Mean: {mean_value:.2f}")
    print(f"Median: {median_value:.2f}")
    print(f"Mode: {mode_value} Odor_Value: {odor_value}")
    print(f"Range: {range_value}")
    print(f"Variance: {variance_value:.2f}")
    print(f"Standard Deviation: {std_dev_value:.2f}")
    print(f"1st Quartile: {q1_value}")
    print(f"2nd Quartile (Median): {q2_value}")
    print(f"3rd Quartile: {q3_value}")

calculate_descriptive_statistics(df, 'odor')

Descriptive Statistics of the most relevant feature:

Most Relevant Feature: odor
Minimum: 1
Maximum: 9
Mean: 5.14
Median: 6.00
Mode: 6 Odor_Value: n
Range: 8
Variance: 4.43
Standard Deviation: 2.10
1st Quartile: 3.0
2nd Quartile (Median): 6.0
3rd Quartile: 6.0


## Data Exploration Task 4 - Frequency Distribution

In [17]:
def frequency_distribution(data, column_name):
    value_counts = data[column_name].value_counts()

    unique_values = value_counts.index.tolist()
    encoding_mapping = {value: index + 1 for index, value in enumerate(unique_values)}

    value_counts_df = pd.DataFrame(value_counts).reset_index()
    value_counts_df.columns = ['Value', 'Frequency']
    value_counts_df['Encoded'] = value_counts_df['Value'].map(encoding_mapping)
    value_counts_df['Relative Frequency (%)'] = (value_counts_df['Frequency'] / value_counts_df['Frequency'].sum()) * 100

    print(f"Frequency Distribution of the Most Relevant Feature '{column_name}':\n")
    print(value_counts_df[['Encoded', 'Value', 'Frequency', 'Relative Frequency (%)']].to_string(index=False))

frequency_distribution(df, 'odor')

Frequency Distribution of the Most Relevant Feature 'odor':

 Encoded Value  Frequency  Relative Frequency (%)
       1     n       3528               43.426883
       2     f       2160               26.587888
       3     y        576                7.090103
       4     s        576                7.090103
       5     a        400                4.923683
       6     l        400                4.923683
       7     p        256                3.151157
       8     c        192                2.363368
       9     m         36                0.443131


## Data Exploration Task 5 - Frequency Distribution Mean

In [18]:
def calculate_frequency_distribution_mean(data, column_name):
    
    unique_values = sorted(data[column_name].unique())
    encoding_mapping = {value: index + 1 for index, value in enumerate(unique_values)}
    value_counts = data[column_name].value_counts()
    sum_of_products = sum(encoding_mapping[value] * frequency for value, frequency in value_counts.items())
    total_instances = len(data)
    mean_value = sum_of_products / total_instances
    
    print(f"Mean of the frequency distribution of '{column_name}': {mean_value:.4f}")

calculate_frequency_distribution_mean(df, 'odor')

Mean of the frequency distribution of 'odor': 5.1448


## Data Shaping Task 1 - Methodology & Code

The methodology to generate synthetic mushroom data closely emulates the structure of the original dataset through a set of feature generation functions. Each function randomly selects values for a specific mushroom characteristic, such as cap shape, color, or odor, from predefined sets. The generate_mushroom_data function orchestrates this process, initializing an empty DataFrame and populating it by calling these feature functions for each instance. 

The resulting DataFrame, containing 2000 rows representing synthetic mushroom samples, is then exported to a CSV file. This approach ensures the creation of a diverse yet realistic dataset that captures the variability present in the original mushroom data, suitable for various data analysis and modeling tasks.

In [10]:
class syntheticDataGenerator:
    
    def __init__(self, num_instances):
        
        self.num_instances = num_instances
    

    def generate_cap_shape_instances(self, num_instances):
        cap_shape_values = ['x', 'b', 's', 'f', 'k', 'c']
        cap_shape_instances = random.choices(cap_shape_values, k=num_instances)
    
        return cap_shape_instances

    def generate_cap_surface_instances(self, num_instances):
        cap_surface_values = ['s', 'y', 'f', 'g']
        cap_surface_instances = random.choices(cap_surface_values, k=num_instances)
    
        return cap_surface_instances

    def generate_cap_color_instances(self, num_instances):
        cap_color_values = ['n', 'y', 'w', 'g', 'e', 'p', 'b', 'u', 'c', 'r']
        cap_color_instances = random.choices(cap_color_values, k=num_instances)
    
        return cap_color_instances

    def generate_bruises_instances(self, num_instances):
        bruises_values = ['t', 'f']
        bruises_instances = random.choices(bruises_values, k=num_instances)
    
        return bruises_instances

    def generate_odor_instances(self, num_instances):
        odor_values = ['p', 'a', 'l', 'n', 'f', 'c', 'y', 's', 'm']
        odor_instances = random.choices(odor_values, k=num_instances)
    
        return odor_instances

    def generate_gill_attachment_instances(self, num_instances):
        gill_attachment_values = ['f', 'a', 'd', 'n']
        gill_attachment_instances = random.choices(gill_attachment_values, k=num_instances)
    
        return gill_attachment_instances

    def generate_gill_spacing_instances(self, num_instances):
    
        gill_spacing_values = ['c', 'w', 'd']
        gill_spacing_instances = random.choices(gill_spacing_values, k=num_instances)
    
        return gill_spacing_instances

    def generate_gill_size_instances(self, num_instances):
        gill_size_values = ['n', 'b']
        gill_size_instances = random.choices(gill_size_values, k=num_instances)
    
        return gill_size_instances

    def generate_gill_color_instances(self, num_instances):
        gill_color_values = ['k', 'n', 'g', 'p', 'w', 'h', 'u', 'e', 'b', 'r', 'y', 'o']
        gill_color_instances = random.choices(gill_color_values, k=num_instances)
    
        return gill_color_instances

    def generate_stalk_shape_instances(self, num_instances):
        stalk_shape_values = ['e', 't']
        stalk_shape_instances = random.choices(stalk_shape_values, k=num_instances)
    
        return stalk_shape_instances

    def generate_stalk_root_instances(self, num_instances):
        stalk_root_values = ['e', 'c', 'b', 'r', 'z', 'u']
        stalk_root_instances = random.choices(stalk_root_values, k=num_instances)
    
        return stalk_root_instances

    def generate_stalk_surface_above_ring_instances(self, num_instances):
        stalk_surface_above_ring_values = ['s', 'f', 'k', 'y']
        stalk_surface_above_ring_instances = random.choices(stalk_surface_above_ring_values, k=num_instances)
    
        return stalk_surface_above_ring_instances

    def generate_stalk_surface_below_ring_instances(self, num_instances):
        stalk_surface_below_ring_values = ['s', 'f', 'y', 'k']
        stalk_surface_below_ring_instances = random.choices(stalk_surface_below_ring_values, k=num_instances)
    
        return stalk_surface_below_ring_instances

    def generate_stalk_color_above_ring_instances(self, num_instances):
        stalk_color_above_ring_values = ['w', 'g', 'p', 'n', 'b', 'e', 'o', 'c', 'y']
        stalk_color_above_ring_instances = random.choices(stalk_color_above_ring_values, k=num_instances)
    
        return stalk_color_above_ring_instances

    def generate_stalk_color_below_ring_instances(self, num_instances):
        stalk_color_below_ring_values = ['w', 'p', 'g', 'b', 'n', 'e', 'y', 'o', 'c'] 
        stalk_color_below_ring_instances = random.choices(stalk_color_below_ring_values, k=num_instances)
    
        return stalk_color_below_ring_instances

    def generate_veil_type_instances(self, num_instances):
        veil_type_values = ['p', 'u']
        veil_type_instances = random.choices(veil_type_values, k=num_instances)
    
        return veil_type_instances

    def generate_veil_color_instances(self, num_instances):
        veil_color_values = ['w', 'n', 'o', 'y']
        veil_color_instances = random.choices(veil_color_values, k=num_instances)
    
        return veil_color_instances

    def generate_ring_number_instances(self, num_instances):
        ring_number_values = ['o', 't', 'n']
        ring_number_instances = random.choices(ring_number_values, k=num_instances)
    
        return ring_number_instances

    def generate_ring_type_instances(self, num_instances):
        ring_type_values = ['p', 'e', 'l', 'f', 'n']
        ring_type_instances = random.choices(ring_type_values, k=num_instances)
    
        return ring_type_instances

    def generate_spore_print_color_instances(self, num_instances):
        spore_print_color_values = ['k', 'n', 'u', 'h', 'w', 'r', 'o', 'y', 'b']
        spore_print_color_instances = random.choices(spore_print_color_values, k=num_instances)
    
        return spore_print_color_instances

    def generate_population_instances(self, num_instances):
        population_values = ['s', 'n', 'a', 'v', 'y', 'c']
        population_instances = random.choices(population_values, k=num_instances)
    
        return population_instances

    def generate_habitat_instances(self, num_instances):
        habitat_values = ['u', 'g', 'm', 'd', 'p', 'w', 'l']
        habitat_instances = random.choices(habitat_values, k=num_instances)
    
        return habitat_instances

    def generate_poisonous_instances(self, num_instances):
        poisonous_values = ['p', 'e']
        poisonous_instances = random.choices(poisonous_values, k=num_instances)
    
        return poisonous_instances


    def generate_mushroom_data(self):
        
        num_instances = self.num_instances
        
        columns = ['cap-shape', 'cap-surface', 'cap-color', 'bruises', 'odor', 'gill-attachment',
               'gill-spacing', 'gill-size', 'gill-color', 'stalk-shape', 'stalk-root',
               'stalk-surface-above-ring', 'stalk-surface-below-ring', 'stalk-color-above-ring',
               'stalk-color-below-ring', 'veil-type', 'veil-color', 'ring-number', 'ring-type',
               'spore-print-color', 'population', 'habitat', 'poisonous']

        mushroom_data = pd.DataFrame(columns=columns)

        mushroom_data['cap-shape'] = self.generate_cap_shape_instances(num_instances)
        mushroom_data['cap-surface'] = self.generate_cap_surface_instances(num_instances)
        mushroom_data['cap-color'] = self.generate_cap_color_instances(num_instances)
        mushroom_data['bruises'] = self.generate_bruises_instances(num_instances)
        mushroom_data['odor'] = self.generate_odor_instances(num_instances)
        mushroom_data['gill-attachment'] = self.generate_gill_attachment_instances(num_instances)
        mushroom_data['gill-spacing'] = self.generate_gill_spacing_instances(num_instances)
        mushroom_data['gill-size'] = self.generate_gill_size_instances(num_instances)
        mushroom_data['gill-color'] = self.generate_gill_color_instances(num_instances)
        mushroom_data['stalk-shape'] = self.generate_stalk_shape_instances(num_instances)
        mushroom_data['stalk-root'] = self.generate_stalk_root_instances(num_instances)
        mushroom_data['stalk-surface-above-ring'] = self.generate_stalk_surface_above_ring_instances(num_instances)
        mushroom_data['stalk-surface-below-ring'] = self.generate_stalk_surface_below_ring_instances(num_instances)
        mushroom_data['stalk-color-above-ring'] = self.generate_stalk_color_above_ring_instances(num_instances)
        mushroom_data['stalk-color-below-ring'] = self.generate_stalk_color_below_ring_instances(num_instances)
        mushroom_data['veil-type'] = self.generate_veil_type_instances(num_instances)
        mushroom_data['veil-color'] = self.generate_veil_color_instances(num_instances)
        mushroom_data['ring-number'] = self.generate_ring_number_instances(num_instances)
        mushroom_data['ring-type'] = self.generate_ring_type_instances(num_instances)
        mushroom_data['spore-print-color'] = self.generate_spore_print_color_instances(num_instances)
        mushroom_data['population'] = self.generate_population_instances(num_instances)
        mushroom_data['habitat'] = self.generate_habitat_instances(num_instances)
        mushroom_data['poisonous'] = self.generate_poisonous_instances(num_instances)

        return mushroom_data
    
    def populate_synthetic_data_table(self, data):
        
        self.synthetic_data = data
    
        conn = mysql.connector.connect(host='localhost', user='root', password='ABCDabcd1234$', database='mushroom_classification_project')
        cursor = conn.cursor()

        for index, row in synthetic_data.iterrows():
            cursor.execute("INSERT INTO synthetic_data (cap_shape, cap_surface, cap_color, bruises, odor, gill_attachment, "
                       "gill_spacing, gill_size, gill_color, stalk_shape, stalk_root, stalk_surface_above_ring, "
                       "stalk_surface_below_ring, stalk_color_above_ring, stalk_color_below_ring, veil_type, veil_color, "
                       "ring_number, ring_type, spore_print_color, population, habitat, target) "
                       "VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)",
                       tuple(row))

        conn.commit()
        cursor.close()
        conn.close()

num_instances = 2000
syntheticData_obj = syntheticDataGenerator(num_instances)
mushroom_data = syntheticData_obj.generate_mushroom_data()
syntheticData_obj.populate_synthetic_data_table(mushroom_data)

In [7]:
mushroom_data.count()

cap-shape                   2000
cap-surface                 2000
cap-color                   2000
bruises                     2000
odor                        2000
gill-attachment             2000
gill-spacing                2000
gill-size                   2000
gill-color                  2000
stalk-shape                 2000
stalk-root                  2000
stalk-surface-above-ring    2000
stalk-surface-below-ring    2000
stalk-color-above-ring      2000
stalk-color-below-ring      2000
veil-type                   2000
veil-color                  2000
ring-number                 2000
ring-type                   2000
spore-print-color           2000
population                  2000
habitat                     2000
poisonous                   2000
dtype: int64

## Data Shaping Task 2 - Database Schema & Explanation

![BigData-2.JPG](attachment:BigData-2.JPG)

# Features Table:

Features_id (Primary Key): Unique identifier for each feature.
Feature_Name: Name of the feature.

# Feature_Values Table:

Feature_Value_id (Primary Key): Unique identifier for each feature value.
Features_id (Foreign Key): References the Features table.
Target_id (Foreign Key): References the Target table.
Feature_category_value: The specific value for each feature.

# Target Table:

Target_id (Primary Key): Unique identifier for each target value.
Target_name: Name of the target (e.g., "poisonous" or "edible").
Target_value: The specific value for each target.
Constraints and Relationships:

# Primary and Foreign Keys:

Features_id in the Feature_Values table is a foreign key referencing Features_id in the Features table.
Target_id in the Feature_Values table is a foreign key referencing Target_id in the Target table.

# Relationships:

One-to-Many Relationship between Features and Feature_Values: Each feature in the Features table can have multiple values associated with it in the Feature_Values table.

One-to-Many Relationship between Target and Feature_Values: Each target in the Target table can be associated with multiple feature values in the Feature_Values table.

# Why these groupings:

Normalization: The schema follows the principles of database normalization, with the Feature_Values table acting as a junction table to avoid data redundancy.

Flexibility: This design allows for flexibility in handling features and their values independently from targets, and vice versa.

The schema is structured to accommodate scenarios where features might have multiple values associated with different targets and where targets can have multiple associated feature values. This structure supports a more modular and scalable database design.

## Data Sampling Task 1 - Multipurpose Function

In [20]:
class MushroomDataset:
    def __init__(self, data):
        self.data = data

    def purposive_sampling(self, criteria_column, criteria_values, num_instances):
        sample = self.data[self.data[criteria_column].isin(criteria_values)].head(num_instances)
        return sample

    def stratified_sampling(self, feature, fraction, num_instances):
        strata_samples = []
        for value in self.data[feature].unique():
            stratum = self.data[self.data[feature] == value].sample(frac=fraction).head(num_instances)
            strata_samples.append(stratum)
        return strata_samples

    def convenience_sampling(self, feature, num_instances):
        convenience_sample = self.data[self.data[feature] == 'n'].head(num_instances)
        return convenience_sample

    def simple_random_sampling(self, num_instances):
        total_records = len(self.data)
        random_indices = np.random.choice(total_records, size=min(num_instances, total_records), replace=False)
        random_sample = self.data.iloc[random_indices]
        return random_sample
    
    def systematic_sampling(self, criterion, seq=0, num_instances=None):
        sorted_data = self.data.sort_values(by=criterion)
        selected_indices = sorted_data.index[::2 if seq == 0 else 1]
        systematic_sample = self.data.loc[selected_indices].head(num_instances) if num_instances else self.data.loc[selected_indices]
        return systematic_sample

    def generate_samples(self, num_instances, criteria):
        if criteria == 'purposive':
            return self.purposive_sampling('odor', ['n', 'f', 'y', 's'], num_instances)
        elif criteria == 'convenience':
            return self.convenience_sampling('gill-size', num_instances)
        elif criteria == 'simple_random':
            return self.simple_random_sampling(num_instances)
        elif criteria == 'systematic':
            return self.systematic_sampling('cap-shape', seq=0, num_instances=num_instances)

mushroom_dataset = MushroomDataset(df)

## Data Sampling Task 2/1 

In [21]:
class Sample1_purposivesampling:
    def __init__(self, data):
        self.data = data

    def descriptive_statistics(self, df_sample):
        stats = {}
        for column in df_sample.columns:
            count = df_sample[column].count()
            unique = df_sample[column].nunique()
            top = df_sample[column].mode().iloc[0]
            freq = df_sample[column].value_counts().iloc[0]
            stats[column] = {'Count': count, 'Unique Values': unique, 'Mode': top, 'Mode_Freq': freq}

        stats_df = pd.DataFrame(stats).T
        stats_df.index.name = 'Feature'
        return stats_df

    def purposive_sampling(self):
        sample = mushroom_dataset.generate_samples(300, 'purposive')
        sample.to_csv('DataSample1.csv', index =  False)
        return sample

    def display_statistics(self, criteria_column, criteria_values):
        sample = self.purposive_sampling()
        stats = self.descriptive_statistics(sample)
        print(f"Descriptive Statistics for Sample based on '{criteria_column}':")
        return stats


statistics_calculator = Sample1_purposivesampling(df)
stats_odor = statistics_calculator.display_statistics('odor', ['n', 'f', 'y', 's'])
stats_odor

Descriptive Statistics for Sample based on 'odor':


Unnamed: 0_level_0,Count,Unique Values,Mode,Mode_Freq
Feature,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
cap-shape,300,3,x,153
cap-surface,300,3,f,219
cap-color,300,4,g,124
bruises,300,2,f,269
odor,300,1,n,300
gill-attachment,300,1,f,300
gill-spacing,300,2,w,180
gill-size,300,2,b,211
gill-color,300,7,n,83
stalk-shape,300,2,t,211


## Data Sampling Task 2/2 

In [22]:
class ConvenienceSampler:
    def __init__(self, data):
        self.data = data
        
    def descriptive_statistics(self, sample):
        stats = {}
        for column in sample.columns:
            count = sample[column].count()
            unique = sample[column].nunique()
            top = sample[column].mode().iloc[0]
            freq = sample[column].value_counts().iloc[0]
            stats[column] = {'Count': count, 'Unique Values': unique, 'Mode': top, 'Mode_Freq': freq}

        stats_df = pd.DataFrame(stats).T
        stats_df.index.name = 'Feature'
        return stats_df

    def convenience_sampling(self):
        convenience_sample = mushroom_dataset.generate_samples(300, 'convenience')
        convenience_sample.to_csv('DataSample2.csv', index =  False)
        return convenience_sample

    def display_statistics(self, stats_df):
        print(f"Descriptive Statistics for Sample based on 'Gill-Size' and Feature Value n:")
        return stats_df

        
convenience_sampler = ConvenienceSampler(df)
convenience_sample = convenience_sampler.convenience_sampling()
stats_convenience = convenience_sampler.descriptive_statistics(convenience_sample)
convenience_sampler.display_statistics(stats_convenience)

Descriptive Statistics for Sample based on 'Gill-Size' and Feature Value n:


Unnamed: 0_level_0,Count,Unique Values,Mode,Mode_Freq
Feature,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
cap-shape,300,3,x,170
cap-surface,300,3,f,133
cap-color,300,4,n,114
bruises,300,2,t,211
odor,300,4,p,123
gill-attachment,300,1,f,300
gill-spacing,300,2,c,212
gill-size,300,1,n,300
gill-color,300,5,n,82
stalk-shape,300,2,e,212


## Data Sampling Task 2/3

In [23]:
class SimpleRandomSampler:
    def __init__(self, data):
        self.data = data

    def simple_random_sampling(self):
        random_sample =  mushroom_dataset.generate_samples(300, 'simple_random')
        random_sample.to_csv('DataSample3.csv', index =  False)
        return random_sample

    def descriptive_statistics(self, df_sample):
        stats = {}
        for column in df_sample.columns:
            count = df_sample[column].count()
            unique = df_sample[column].nunique()
            top = df_sample[column].mode().iloc[0]
            freq = df_sample[column].value_counts().iloc[0]
            stats[column] = {'Count': count, 'Unique Values': unique, 'Mode': top, 'Mode_Freq': freq}

        stats_df = pd.DataFrame(stats).T
        stats_df.index.name = 'Feature'
        return stats_df

    def display_statistics(self, stats_df):
        print(f"Descriptive Statistics for Sample based on Random Sampling:")
        return stats_df

simple_random_sampler = SimpleRandomSampler(df)
random_sample = simple_random_sampler.simple_random_sampling()
stats_random = simple_random_sampler.descriptive_statistics(random_sample)
simple_random_sampler.display_statistics(stats_random)

Descriptive Statistics for Sample based on Random Sampling:


Unnamed: 0_level_0,Count,Unique Values,Mode,Mode_Freq
Feature,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
cap-shape,300,5,x,135
cap-surface,300,3,y,115
cap-color,300,9,n,85
bruises,300,2,f,177
odor,300,9,n,129
gill-attachment,300,2,f,294
gill-spacing,300,2,c,260
gill-size,300,2,b,202
gill-color,300,11,b,67
stalk-shape,300,2,t,178


## Data Sampling Task 2/4

In [24]:
class SystematicSampling:

    def __init__(self, data):
        self.data = data

    def descriptive_statistics(self, df_sample):
        stats = {}
        for column in df_sample.columns:
            count = df_sample[column].count()
            unique = df_sample[column].nunique()
            top = df_sample[column].mode().iloc[0]
            freq = df_sample[column].value_counts().iloc[0]
            stats[column] = {'Count': count, 'Unique Values': unique, 'Mode': top, 'Mode_Freq': freq}

        stats_df = pd.DataFrame(stats).T
        stats_df.index.name = 'Feature'
        return stats_df

    def systematic_sampling(self):
        systematic_sample = mushroom_dataset.generate_samples(300, 'systematic')
        systematic_sample.to_csv('DataSample4.csv', index =  False)
        return systematic_sample

    def display_systematic_sample_statistics(self, criterion, seq=0):
        systematic_sample = self.systematic_sampling()
        stats_systematic = self.descriptive_statistics(systematic_sample)
        print(f"\nDescriptive Statistics for Systematic Sample (Criterion: {criterion}, Seq: {seq}):")
        return stats_systematic

Systematic_sampler = SystematicSampling(df)
Systematic_sampler.display_systematic_sample_statistics('cap-shape', seq=0)


Descriptive Statistics for Systematic Sample (Criterion: cap-shape, Seq: 0):


Unnamed: 0_level_0,Count,Unique Values,Mode,Mode_Freq
Feature,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
cap-shape,300,3,b,226
cap-surface,300,4,s,128
cap-color,300,7,w,95
bruises,300,2,t,195
odor,300,6,n,154
gill-attachment,300,2,f,275
gill-spacing,300,2,c,246
gill-size,300,2,b,274
gill-color,300,11,w,77
stalk-shape,300,2,e,232
