<a href="https://colab.research.google.com/github/gitmystuff/DTSC4050/blob/main/Week_08-Feature_Engineering/Week_08_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 08 - Assignment

Your Name

## Getting Started

* Colab - get notebook from gitmystuff DTSC4050 repository
* Save a Copy in Drive
* Remove Copy of
* Edit name
* Take attendance
* Clean up Colab Notebooks folder
* Submit shared link


**Assignment Instructions by Section:**

1.  **Data Exploration and Correlation (20%):**
    * **Task:**
        * Load the dataset and perform initial data exploration (e.g., describe, info).
        * Identify numerical features and calculate the correlation matrix.
        * Visualize the correlation matrix using a heatmap.
        * Identify highly correlated features and discuss potential multicollinearity issues.
    * **Learning Objectives:**
        * Understand correlation and its implications.
        * Use Python libraries (e.g., pandas, seaborn) for data exploration and visualization.

2.  **Derived Variables (20%):**
    * **Task:**
        * Based on domain knowledge or observed patterns, create at least two meaningful derived variables.
        * Explain the rationale behind creating each derived variable.
        * Analyze the impact of the derived variables on the correlation matrix.
        * Discuss if any of the derived variables help to reduce multicollinearity.
    * **Learning Objectives:**
        * Apply domain knowledge to create useful features.
        * Understand how derived variables can impact data relationships.
        * Understand how derived variables can impact multicollinearity.

3.  **Missing Value Imputation (15%):**
    * **Task:**
        * Identify features with missing values.
        * Choose appropriate imputation methods for each feature (e.g., mean, median, mode, KNN imputation).
        * Justify the choice of imputation method for each feature.
        * Evaluate the impact of imputation on the data distribution.
    * **Learning Objectives:**
        * Handle missing data effectively.
        * Apply different imputation techniques.
        * Understand the impact of imputation on data.

4.  **Categorical Encoding (15%):**
    * **Task:**
        * Identify categorical features (nominal and ordinal).
        * Apply appropriate encoding techniques (e.g., one-hot encoding, ordinal encoding, frequency encoding).
        * Explain the rationale for choosing each encoding method.
        * Discuss the potential impact of encoding on model performance.
    * **Learning Objectives:**
        * Encode categorical features appropriately.
        * Understand the differences between encoding methods.
        * Understand the impact of high cardinality.

5.  **Outlier Handling (15%):**
    * **Task:**
        * Identify potential outliers in numerical features using visualization or statistical methods (e.g., box plots, Z-scores).
        * Choose and apply appropriate outlier handling techniques (e.g., removal, capping, transformation).
        * Justify the chosen methods and discuss their impact.
    * **Learning Objectives:**
        * Detect and handle outliers.
        * Understand the impact of outliers on data.

6.  **Feature Scaling (15%):**
    * **Task:**
        * Apply appropriate scaling techniques to numerical features (e.g., standardization, min-max scaling).
        * Explain the rationale for choosing each scaling method.
        * Discuss the importance of scaling for different machine learning algorithms.
    * **Learning Objectives:**
        * Scale numerical features effectively.
        * Understand the differences between scaling methods.
        * Understand when scaling is necessary.


In [None]:
# set seed
import time
import numpy as np
import random

def generate_user_seed():
    # Get current time in nanoseconds (more granular)
    nanoseconds = time.time_ns()

    # Add a small random component to further reduce collision chances
    random_component = random.randint(0, 1000)  # Adjust range as needed

    # Combine them (XOR is a good way to mix values)
    seed = nanoseconds ^ random_component

    # Ensure the seed is within the valid range for numpy's seed
    seed = seed % (2**32)  # Modulo to keep it within 32-bit range

    return seed

user_seed = generate_user_seed()
print(user_seed)
np.random.seed(user_seed)

372378630


In [None]:
import pandas as pd
import numpy as np
# from sklearn.datasets import make_regression, make_classification

# Number of samples
n_samples = 1000

# Numerical Features
numerical_data = {
    'squirrel_sightings': np.random.normal(0, 1, n_samples),
    'cat_nap_tilt': np.random.exponential(1, n_samples),
    'cat_meow': np.random.normal(0, 1, n_samples),
    'minutes_to_eat': None,
    'turtle_blinks': np.random.normal(0, 1, n_samples),
    'cat_disposition': np.random.normal(0, 1, n_samples)
}

#generate correlated data.
correlated_data = np.random.multivariate_normal([0, 0], [[1, 0.8], [0.8, 1]], n_samples)
numerical_data['cat_meow'] = correlated_data[:,0]
numerical_data['minutes_to_eat'] = correlated_data[:,1]

#add outliers
numerical_data['turtle_blinks'][np.random.choice(n_samples, 20)] += 5

#add missing values
missing_indices = np.random.choice(n_samples, 100)
numerical_data['cat_disposition'][missing_indices] = np.nan

# Categorical Features
categorical_data = {
    'cat_personality': np.random.choice(['Annoying', 'Sweet', 'Wise', 'Playful'], n_samples),
    'meow_volume': np.random.choice(['Low', 'Medium', 'High'], n_samples, p=[0.2,0.5,0.3]),
    'mystery_stain_colors': np.random.choice(['Red','Blue','Yellow', None], n_samples)
}

# Create DataFrame
df_numerical = pd.DataFrame(numerical_data)
df_categorical = pd.DataFrame(categorical_data)
df = pd.concat([df_numerical, df_categorical], axis=1)

df['happiness'] = df['squirrel_sightings'] + 0.5 * df['cat_meow'] + np.random.normal(0, 0.5, n_samples)

# Save to CSV
df.to_csv('assignment_8.csv', index=False)

In [None]:
import pandas as pd

df = pd.read_csv('assignment_8.csv')
df.head()

(1000, 10)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   squirrel_sightings    1000 non-null   float64
 1   cat_nap_tilt          1000 non-null   float64
 2   cat_meow              1000 non-null   float64
 3   minutes_to_eat        1000 non-null   float64
 4   turtle_blinks         1000 non-null   float64
 5   cat_disposition       908 non-null    float64
 6   cat_personality       1000 non-null   object 
 7   meow_volume           1000 non-null   object 
 8   mystery_stain_colors  739 non-null    object 
 9   happiness             1000 non-null   float64
dtypes: float64(7), object(3)
memory usage: 78.3+ KB
None


Unnamed: 0,squirrel_sightings,cat_nap_tilt,cat_meow,minutes_to_eat,turtle_blinks,cat_disposition,cat_personality,meow_volume,mystery_stain_colors,happiness
0,-0.533726,1.529659,0.865465,0.556167,0.073019,-0.666422,Sweet,Medium,Red,-0.04107
1,0.156814,0.025311,1.163224,0.464324,0.257734,0.596533,Wise,Medium,,1.005483
2,0.543115,0.543681,0.630995,0.309871,0.932513,0.955078,Annoying,Low,Blue,1.027032
3,-0.204599,0.205113,-1.84764,-2.284689,-1.35956,0.935399,Sweet,Low,Yellow,-0.776038
4,0.639063,1.682475,0.689145,1.131473,0.242905,-1.114813,Wise,Low,Yellow,1.360179


1.  **Data Exploration and Correlation (20%):**
    * **Task:**
        * Perform initial data exploration (e.g., shape, info, describe).
        * Identify numerical features and calculate the correlation matrix.
        * Visualize the correlation matrix using a heatmap (caution).
        * Identify highly correlated features and discuss potential multicollinearity issues.
        * Document what you are doing (using comments is ok).
    * **Learning Objectives:**
        * Understand correlation and its implications.
        * Use Python libraries (e.g., pandas, seaborn) for data exploration and visualization.

2.  **Derived Variables (20%):**
    * **Task:**
        * Based on domain knowledge or observed patterns, create at least two meaningful derived variables.
        * Explain the rationale behind creating each derived variable.
        * Analyze the impact of the derived variables on the correlation matrix.
        * Discuss if any of the derived variables help to reduce multicollinearity.
    * **Learning Objectives:**
        * Apply domain knowledge to create useful features.
        * Understand how derived variables can impact data relationships.
        * Understand how derived variables can impact multicollinearity.

3.  **Missing Value Imputation (15%):**
    * **Task:**
        * Identify features with missing values.
        * Choose appropriate imputation methods for each feature (e.g., mean, median, mode, KNN imputation).
        * Justify the choice of imputation method for each feature.
        * Evaluate the impact of imputation on the data distribution.
    * **Learning Objectives:**
        * Handle missing data effectively.
        * Apply different imputation techniques.
        * Understand the impact of imputation on data.

4.  **Categorical Encoding (15%):**
    * **Task:**
        * Identify categorical features (nominal and ordinal).
        * Apply appropriate encoding techniques (e.g., one-hot encoding, ordinal encoding).
        * Explain the rationale for choosing each encoding method.
        * Discuss the potential impact of encoding on model performance.
    * **Learning Objectives:**
        * Encode categorical features appropriately.
        * Understand the differences between encoding methods.
        * Understand the impact of high cardinality.

5.  **Outlier Handling (15%):**
    * **Task:**
        * Identify potential outliers in numerical features using visualization or statistical methods (e.g., box plots, Z-scores).
        * Choose and apply appropriate outlier handling techniques (e.g., removal, capping, transformation).
        * Justify the chosen methods and discuss their impact.
    * **Learning Objectives:**
        * Detect and handle outliers.
        * Understand the impact of outliers on data.

6.  **Feature Scaling (15%):**
    * **Task:**
        * Apply appropriate scaling techniques to numerical features (e.g., standardization, min-max scaling).
        * Explain the rationale for choosing each scaling method.
        * Discuss the importance of scaling for different machine learning algorithms.
    * **Learning Objectives:**
        * Scale numerical features effectively.
        * Understand the differences between scaling methods.
        * Understand when scaling is necessary.