<a href="https://colab.research.google.com/github/gitmystuff/DTSC4050/blob/main/Week_12-Classification_I/Data_Science_Fiction_II.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Science Fiction II

## Getting Started

* Colab - get notebook from gitmystuff DTSC4050 repository
* Save a Copy in Drive
* Remove Copy of
* Edit name
* Clean up Colab Notebooks folder
* Submit shared link

## Instructions

The goal of this assignment is to take messy data, clean it up, and then analyze it using logistic regression.

* Run Parts 1 and 2 being careful to take in what's going on
* In Part 3 you are asked to clean the data in preparation for modeling
* Part 4 - perform necessary feature engineering
* Part 5 - select the variables that will be more useful for classification
* Part 6 - model the data and evaluate, explain concepts when asked

# Part 1 - The Data

## Seed the Project

In [None]:
import time
import numpy as np
import random

def generate_user_seed():
    # Get current time in nanoseconds (more granular)
    nanoseconds = time.time_ns()

    # Add a small random component to further reduce collision chances
    random_component = random.randint(0, 1000)  # Adjust range as needed

    # Combine them (XOR is a good way to mix values)
    seed = nanoseconds ^ random_component

    # Ensure the seed is within the valid range for numpy's seed
    seed = seed % (2**32)  # Modulo to keep it within 32-bit range

    return seed

user_seed = generate_user_seed()
print(user_seed)
random_state = np.random.seed(user_seed)

## Faker

In [None]:
pip install Faker -q

In [None]:
habitable_planets = [
    "Alpha Centauri III",
    "Eden",
    "Terra Nova",
    "Tiberius",
    "Vega Colony",
    "Cait",
    "Andoria",
    "Vulcanis",
    "Risa",
    "Betazed",
    "Ba'ku",
    "Aldea",
    "Nimbus III",
    "Deneva",
    "Capella IV",
    "Organia",
    "Trillius Prime",
    "Kaelon II",
    "Mintaka III",
    "Rubicun III",
    "Pacifica",
    "Tau Ceti III",
    "Melina",
    "Argelius II",
    "Iconia",
    "Alderaan",
    "Naboo",
    "Bespin (Cloud City)",
    "Yavin IV",
    "Endor (Forest Moon)",
    "Kashyyyk",
    "Mon Cala",
    "Corellia",
    "Chandrila",
    "Ryloth",
    "Cato Neimoidia",
    "Felucia",
    "Saleucami",
    "Stewjon",
    "Iego",
    "Glee Anselm",
    "Mirial",
    "Serenno",
    "Malastare",
    "Dantooine",
    "Haruun Kal",
    "Manaan",
    "Zolan",
    "Ord Mantell",
    "Pantora"
]

In [None]:
# create demographic data
import numpy as np
import pandas as pd
from faker import Faker
fake = Faker()

n = 1000

output = []
for x in range(n):
    biology = np.random.choice(['Cytophore', 'Kymete'], p=[0.5, 0.5])
    output.append({
        'categorical_1': biology,
        'categorical_2': np.random.choice(['Xylosian', 'Veridian', 'CKaeltharr']),
        'name_1': fake.first_name_female() if biology == 'Cytophore' else fake.first_name_male(),
        'name_2': fake.last_name(),
        'code': fake.zipcode(),
        'date': fake.date_of_birth(),
        'location': np.random.choice(habitable_planets)
    })

demographics = pd.DataFrame(output)
print(demographics.shape)
demographics.head()

## Create Independent Variable Correlated with Class

In [None]:
import numpy as np
import pandas as pd

def generate_feature(df, class_col, coeff, intercept):
    """
    Generates normally distributed feature data for a logistic regression model.

    Args:
        df: The pandas DataFrame containing the class column.
        class_col: The name of the class column (containing 0s and 1s).
        coeff: The coefficient for the feature in the logistic regression model.
        intercept: The intercept of the logistic regression model.

    Returns:
        A pandas Series containing the generated feature data.
    """

    # Generate probabilities based on the class
    probs = np.random.rand(len(df))  # Initial random probabilities
    probs = np.where(df[class_col] == 1, probs * 0.8 + 0.2, probs * 0.8)  # Adjust for class

    # Apply the inverse logit (logit) function
    logits = np.log(probs / (1 - probs))

    # Calculate the feature values
    feature_values = (logits - intercept) / coeff

    return pd.Series(feature_values)



## Make Classification

In [None]:
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression

def make_linear_y(row):
  model = LogisticRegression()
  model.fit(X, y)
  coefficients = model.coef_
  intercept = model.intercept_
  f_of_x = intercept + coefficients[0][0]*row['informative_1'] + coefficients[0][1]*row['informative_2']
  # print(f_of_x[0])
  return f_of_x[0]

# Adjust the make_classification parameters:
# Set n_informative and n_redundant to values that sum to less than n_features
X, y = make_classification(n_samples=n, n_features=2, n_informative=2, n_redundant=0, n_repeated=0, random_state=42)
df = pd.DataFrame(X, columns=['informative_1', 'informative_2'])
df = pd.concat([demographics, df], axis=1).reset_index(drop=True)

df['target'] = df.apply(make_linear_y, axis=1) # an independent variable
df['class'] = y # the dependent variable
df['corr_feature_class'] = generate_feature(df, 'class', 0.5, -1)
df.head()

## Automation Functions

1. gen_null(series, perc)
2. gen_quasi_constants(primary_label, variation_percentage=.2, replace=False)
3. gen_normal_data(mu=0, std=1, size=len(df))
4. gen_uniform_data(size=len(df))
5. gen_multivariate_normal_data(mean=[0, 0], cov=[[1, 0], [0, 1]], size=len(df))
6. gen_correlated_normal_series(original_series, target_correlation, size=len(df))
7. gen_correlated_uniform_series(original_series, correlation_coefficient=0, size=len(df))
8. gen_outliers(mean=0, std_dev=1, size=len(df), outlier_percentage=0.1, outlier_magnitude=3)
9. gen_standard_scaling(mean=50, std_dev=10, size=len(df), scale_factor=1000)
10. gen_minmax_scaling(mean=50, std_dev=10, size=len(df), range_factor=10)
11. random_choice_data(choices, size)

In [None]:
# functions
import pandas as pd
import numpy as np
from scipy.stats import norm
from scipy.optimize import minimize


def gen_null(series, perc):
  """
  Introduces null values (np.nan) into a list based on a specified percentage.

  Args:
      var: The variable to modify.
      perc: The percentage of values to replace with nulls (0-100).

  Returns:
      The modified variable with null.
  """
  var = series.copy()
  num_nulls = int(len(var) * (perc / 100))
  indices_to_replace = np.random.choice(len(var), num_nulls, replace=False)

  for idx in indices_to_replace:
      var[idx] = np.nan

  return var

def gen_quasi_constants(primary_label, variation_percentage=.2, size=len(df)):
  """
  Generates quasi-constant labels for a Series, with a small percentage of variation.

  Args:
      primary_label: The main label to use for most values.
      variation_percentage: The percentage of labels to vary (0-100).

  Returns:
      A new Series containing the quasi-constant labels.
  """

  series = pd.Series(np.full(size, primary_label))
  num_variations = int(size * (variation_percentage / 100))
  variation_indices = np.random.choice(series.index, num_variations, replace=False)
  primary_label = primary_label + '_0'
  variation1 = primary_label + '_1'
  variation2 = primary_label + '_2'

  labels = pd.Series([primary_label] * len(series), index=series.index)
  labels.loc[variation_indices] = np.random.choice([variation1, variation2], size=num_variations)  # Adjust variations as needed

  return labels

def gen_normal_data(mu=0, std=1, size=len(df)):
  """
  Generates a normal dataset given the mean and standard deviation

  Args:
        mu: The mean of the normal distribution.
        std: The standard deviation of the normal distribution.
        size: The number of data points to generate.

  Returns:
        A normally distributed series.
  """
  return np.random.normal(mu, std, size)

def gen_uniform_data(size=len(df)):
  """
  Generates a uniform dataset

  Args:
        size: The number of data points to generate.

  Returns:
        A uniform distributed series.
  """
  return np.random.uniform(size=size)

def gen_multivariate_normal_data(mean=[0, 0], cov=[[1, 0], [0, 1]], size=len(df)):
  """
  Generates two datasets with a multivariate normal distribution given the mean and covariance matrix

  Args:
        mean: The mean of each of the datasets.
        cov: The covariance matrix of the datasets.
        size: The number of data points to generate.

  Returns:
        Two correlated series.
  """
  ds1, ds2 = np.random.multivariate_normal(mean, cov, size, tol=1e-6).T # ds = dataset
  return ds1, ds2

def gen_correlated_normal_series(original_series, target_correlation, size=len(df)):
  """
  Generates a correlated series based on a given series.

  This function takes an original series as input and generates a new series
  that is correlated with the original series. The correlation between the
  original and generated series is approximately equal to the specified
  target correlation.

  The generated series is created by linearly transforming the original series
  and adding Gaussian noise with an adjusted standard deviation to achieve the
  desired correlation.

  Args:
      original_series (numpy.ndarray): The original series.
      target_correlation (float): The desired Pearson correlation coefficient
          between the original and generated series.

  Returns:
      numpy.ndarray: The generated correlated series.
  """
  return np.mean(original_series) + target_correlation * (original_series - np.mean(original_series)) \
  +  np.random.normal(0, np.sqrt(1 - target_correlation**2) * np.std(original_series), len(original_series))
  """
  Explanation

  This one-liner leverages the properties of linear transformations and normal distributions to generate a correlated series.

  It first centers the original_series by subtracting its mean.
  It then scales this centered series by the target_correlation.
  Finally, it adds Gaussian noise with a standard deviation adjusted to ensure the overall correlation matches the target_correlation.
  """

def gen_correlated_uniform_series(original_series, correlation_coefficient=0, size=len(df)):
  """
  Work in progress

  Generates a new series correlated with the given series based on the specified correlation coefficient,
  using rank correlation to ensure the generated series follows a uniform distribution.

  Args:
      original_series (numpy.ndarray or list): The original series.
      correlation_coefficient (float): The desired correlation coefficient between the original and generated series.
      size: The number of data points to generate.

  Returns:
      The generated correlated series with a uniform distribution.
  """
  z_scores = (original_series - np.mean(original_series)) / np.std(original_series)
  correlation_coefficient=.7
  return norm.cdf(correlation_coefficient * norm.ppf(np.random.uniform(size=size)) + np.sqrt(1 - correlation_coefficient**2) * z_scores)

def pearson_r_func(x, y, y_mean, y_std, desired_r):
    x_mean = np.mean(x)
    x_std = np.std(x)
    numerator = np.sum((x - x_mean) * (y - y_mean))
    denominator = x_std * y_std * len(x)
    calculated_r = numerator / denominator
    return (calculated_r - desired_r)**2  # Minimize the squared difference

def minimize_r(original_series, target_correlation, size=len(df)):
    y = original_series
    y_mean = np.mean(y)
    y_std = np.std(y)
    desired_r = target_correlation

    # Initial guess for x values
    x0 = np.random.uniform(size=len(original_series))

    # Solve for x
    result = minimize(pearson_r_func, x0, args=(y, y_mean, y_std, desired_r))

    if result.success:
        x_solution = result.x
        # print("Solution for x:", x_solution)
        return x_solution
    else:
        print("Optimization failed.")

def gen_outliers(mean=0, std_dev=1, size=len(df), outlier_percentage=0.1, outlier_magnitude=3):
    """
    Generates a normal distribution with outliers.

    Args:
        mean (float): The mean of the normal distribution.
        std_dev (float): The standard deviation of the normal distribution.
        size (int): The number of samples to generate.
        outlier_percentage (float): The percentage of outliers to introduce (between 0 and 1).
        outlier_magnitude (float): The magnitude by which outliers deviate from the mean.

    Returns:
        numpy.ndarray: The generated data with outliers.
    """
    data = np.random.normal(mean, std_dev, size)
    num_outliers = int(size * outlier_percentage)
    outlier_indices = np.random.choice(size, num_outliers, replace=False)
    for index in outlier_indices:
        if np.random.rand() < 0.5:
            data[index] += outlier_magnitude
        else:
            data[index] -= outlier_magnitude

    return data

def gen_standard_scaling(mean=50, std_dev=10, size=len(df), scale_factor=1000):
  """
  Generates data with a specified mean and standard deviation, then scales it by a factor to create a distribution needing scaling.

  Args:
      mean (float): The mean of the original distribution.
      std_dev (float): The standard deviation of the original distribution.
      size (int): The number of samples to generate.
      scale_factor (float): The factor by which to scale the original distribution.

  Returns:
      numpy.ndarray: The generated data needing scaling.
  """
  original_data = np.random.normal(mean, std_dev, size)
  return original_data * scale_factor

def gen_minmax_scaling(mean=50, std_dev=10, size=len(df), range_factor=10):
  """
  Generates data with a specified mean and standard deviation, then scales and shifts it to create a distribution needing MinMax scaling.

  Args:
      mean (float): The mean of the original distribution.
      std_dev (float): The standard deviation of the original distribution.
      size (int): The number of samples to generate.
      range_factor (float): The factor to expand the range of the original distribution.

  Returns:
      numpy.ndarray: The generated data needing scaling.
  """

  # Generate the original data
  original_data = np.random.normal(mean, std_dev, size)

  # Expand the range of the data
  min_val = np.min(original_data)
  max_val = np.max(original_data)
  return (original_data - min_val) * range_factor + min_val

def random_choice_data(choices, size):
  """
  Generates a new series correlated with the given series based on the specified correlation coefficient,
  using rank correlation to ensure the generated series follows a uniform distribution.

  Args:
      original_series (numpy.ndarray or list): The original series.
      correlation_coefficient (float): The desired correlation coefficient between the original and generated series.

  Returns:
      numpy.ndarray: The generated correlated series with a uniform distribution.
  """
  return np.random.choice(choices, size=size)


In [None]:
# categorical variables with little correlation to target
df['random choice 2'] = random_choice_data(['Rand Choice 1', 'Rand Choice 2'], size=len(df))
df['random choice 4'] = random_choice_data(['North', 'South', 'East', 'West'], size=len(df))
df['random choice 7'] = random_choice_data(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'], size=len(df))

# categorical random choices with random # of labels
num_labels = np.random.randint(3, 5)
df[f'random label num {num_labels}'] = random_choice_data([f'label num lo {i}' for i in range(1, num_labels + 1)], size=len(df))

num_labels = np.random.randint(10, 15)
df[f'random label num {num_labels}'] = random_choice_data([f'label num hi {i}' for i in range(1, num_labels + 1)], size=len(df))

In [None]:
# categorical variables correlated with target
df['pd qcut1'] = pd.qcut(df['target'], 2, labels=['Low', 'High']) # bi label
df['pd qcut2'] = pd.qcut(df['target'], 4, labels=['Q1', 'Q2', 'Q3', 'Q4']) # 4 labels

quantiles = [0, 0.1, 0.2, 0.4, 0.6, 0.8, 1]
df['pd qcut3'] = pd.qcut(df['target'], quantiles, labels=['G1', 'G2', 'G3', 'G4', 'G5', 'G6']) # 6 labels

In [None]:
# generate four numerical normally distributed continuous features that have a correlation greater than absolute value of .5 with each other
# gen_multivariate_normal_data(mean=[0, 0], cov=[[1, 0], [0, 1]], size=len(df))
df['multicollinearity 1'], df['multicollinearity 2'] = gen_multivariate_normal_data(mean=[0, 0], cov=[[1, .7], [.7, 1]], size=len(df))
df['multicollinearity 3'], df['multicollinearity 4'] = gen_multivariate_normal_data(mean=[0, 0], cov=[[1, .9], [.9, 1]], size=len(df))

In [None]:
# generate two normally distributed features that are correlated with the target
# gen_correlated_normal_series(original_series, target_correlation, size=len(df))
df['correlated w target 1'] = gen_correlated_normal_series(df['target'], target_correlation=.5)
df['correlated w target 2'] = gen_correlated_normal_series(df['target'], target_correlation=.7)
df.info()

In [None]:
# generate two uniformly distributed features that are correlated with the target
# gen_correlated_uniform_series(original_series, correlation_coefficient=0, size=len(df))
df['uniform corr 1'] = gen_correlated_uniform_series(df['target'])
df['uniform corr 2'] = gen_correlated_uniform_series(df['target'])

In [None]:
# create two features that are duplicates of other features
df['duplicate_1'] = df['informative_1']
df['duplicate_2'] = df['informative_2']

In [None]:
# create two numerical features with outliers
df['outliers 1'] = gen_outliers(mean=0, std_dev=1, size=len(df), outlier_percentage=0.1, outlier_magnitude=3)
df['outliers 2'] = gen_outliers(mean=3, std_dev=2, size=len(df), outlier_percentage=0.2, outlier_magnitude=2)

In [None]:
# create a numerical feature that needs standard scaling
df['standard scaling'] = gen_standard_scaling()

In [None]:
# create a numerical feature that needs min max scaling
df['min max scaling'] = gen_minmax_scaling()

In [None]:
# generate null values
for col in df.drop(['class', 'informative_1', 'informative_2', 'target', 'duplicate_1', 'duplicate_2'], axis=1).columns:
    df[col] = gen_null(df[col], np.random.choice([0, 5, 10, 20, 30, 50], size=1).item())

In [None]:
# create two features that have constant values
df['constant_1'] = 'constant_value'
df['constant_2'] = 'constant_value'

In [None]:
# create two features with semi constant values
df['semi_constant_1'] = gen_quasi_constants('q_const', variation_percentage = 1)
df['semi_constant_2'] = gen_quasi_constants('q_const', variation_percentage = 1)

In [None]:
print(df.info())  # check progress

In [None]:
# add duplicates
dupes = df.loc[0:9]
df = pd.concat([df, dupes], axis=0)

# shuffle all columns
# df = df.sample(frac=1).reset_index(drop=True)
# df = df.sample(frac=1, axis=1)

# shuffle selected columns
demographic_columns = demographics.columns
remaining_columns = [col for col in df.columns if col not in demographic_columns]
# print(remaining_columns)
np.random.shuffle(remaining_columns)

# Reassemble the DataFrame with the shuffled columns
df = df[list(demographic_columns) + list(remaining_columns)]

# move target to the end of the list
class_var = 'class'
df = df[df.drop('class', axis=1).columns.tolist() + [class_var]]

print(df.shape)
print(df.info())
df.head()

In [None]:
df.to_csv('data science fiction ii pt 1.csv', index=False)

# Part 2 - Exploratory Data Analysis (EDA)

Exploratory data analysis (EDA) is a data analysis method that helps data scientists understand their data and identify patterns. It's often used as the first step in data analysis.

## Load Data

In [None]:
import pandas as pd

df = pd.read_csv('data science fiction ii pt 1.csv')
print(df.shape)
print(df.info())
df.head()

## Var Types

In [None]:
df_numerical = df.select_dtypes(include='number').columns
df_object = df.select_dtypes(include=['object']).columns
df_discreet = df.select_dtypes(include=['category']).columns
df_categorical_features = df.select_dtypes(include=['category', 'object']).columns
print(df_numerical)
print(df_object)
print(df_discreet)
print(df_categorical_features)

## Correlation

In [None]:
# code along
df[df_numerical].corr().round(2)

In [None]:
# show correlation between the features
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# correlation matrix
sns.set(style="white")

# compute the correlation matrix
corr = df[df_numerical].corr().round(1)

# generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=bool)
mask[np.triu_indices_from(mask)] = True

# set up the matplotlib figure
# f, ax = plt.subplots()
f = plt.figure(figsize=(12, 12))

# generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5}, annot=True);

plt.tight_layout()

In [None]:
# calculate the correlation matrix
corr_matrix = df[df_numerical].corr()

# Create a mask for the upper triangle (to avoid duplicates)
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))

# Convert the correlation matrix to a long format
corr_df = corr_matrix.stack().reset_index()
corr_df.columns = ['feature1', 'feature2', 'correlation']

# Filter for correlations above a certain threshold (e.g., 0.7)
high_corr_df = corr_df[(abs(corr_df['correlation']) > 0.7) & (corr_df['feature1'] != corr_df['feature2'])]

# Sort by absolute correlation in descending order
high_corr_df = high_corr_df.sort_values(by='correlation', ascending=False, key=abs)

# Print the top correlated features
# print(high_corr_df['feature1'].to_list()[4:10])
print(high_corr_df)

# Create a variable to pickle
data = {'correlation scores': high_corr_df}

In [None]:
# check for vif
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from statsmodels.stats.outliers_influence import variance_inflation_factor

# handle null values (using mean imputation for simplicity)
x_copy = df.drop('class', axis=1)._get_numeric_data()
x_copy.fillna(x_copy.mean(), inplace=True)

print(max([variance_inflation_factor(x_copy, i) for i in range(x_copy.shape[1])]))

# calculate VIF
vif = pd.DataFrame()
vif["Variable"] = x_copy.columns
vif["VIF"] = [variance_inflation_factor(x_copy, i) for i in range(x_copy.shape[1])]
print(vif)

## Multicollinearity

* We want high correlation with target
* We don't want high correlation between features
* Drop correlated features
* Combine correlated features

In [None]:
# iterate dropping features with high vif
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from statsmodels.stats.outliers_influence import variance_inflation_factor

removed=[]
x_copy1 = x_copy.copy()
max_vif = thresh = 10
while max_vif >= thresh:
  my_list = [variance_inflation_factor(x_copy1, i) for i in range(x_copy1.shape[1])]
  max_vif = max(my_list)
  if max_vif > thresh:
    max_index = my_list.index(max_vif)
    removed.append(x_copy1.columns[max_index])
    print(x_copy1.columns[max_index], variance_inflation_factor(x_copy1, max_index))
    x_copy1.drop(x_copy1.columns[max_index], axis=1, inplace=True)


# Calculate VIF
vif = pd.DataFrame()
vif["Variable"] = x_copy1.columns
vif["VIF"] = [variance_inflation_factor(x_copy1, i) for i in range(x_copy1.shape[1])]
print(vif)

# Create a variable to pickle
data = {'vif': vif}


In [None]:
print(removed)

## Outliers

In [None]:
# code along
df.boxplot(column=['outliers 1']);

In [None]:
# code along
df.describe()

In [None]:
import pandas as pd

def count_outliers_iqr(df, column):
    """Counts the number of outliers in a DataFrame column using the IQR method."""
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]
    return len(outliers)

def detect_and_print_numerical_outliers_iqr(df):
    """
    Iterates through numerical columns in a DataFrame and prints the
    variable name with the number of outliers based on the IQR method.
    """
    numerical_cols = df.select_dtypes(include=['number']).columns
    for col in numerical_cols:
        num_outliers = count_outliers_iqr(df, col)
        print(f"Variable: {col}, Number of outliers (IQR): {num_outliers}")


detect_and_print_numerical_outliers_iqr(df[df_numerical])

In [None]:
df.to_csv('data science fiction ii pt 2.csv', index=False)

# Part 3 - Data Prep

https://www.udemy.com/course/feature-engineering-for-machine-learning

* Types and characteristics of data
* Missing data imputation
* Categorical encoding
* Variable transformation
* Discretization
* Outliers
* Datetime
* Scaling
* Feature creation

## Load Data

In [None]:
import pandas as pd

df = pd.read_csv('data science fiction ii pt 2.csv')
print(df.shape)
print(df.info())
df.head()

## Clean the Data

In [None]:
# constants

In [None]:
# quasi constants

In [None]:
# duplicate rows

In [None]:
# duplicate features

In [None]:
# missing data

In [None]:
# scaling

In [None]:
# outliers

## Identify Variable Types for Encoding

In [None]:
# df_numerical = df.select_dtypes(include='number').columns
# df_object = df.select_dtypes(include=['object']).columns
# df_discreet = df.select_dtypes(include=['category']).columns
# df_categorical_features = df.select_dtypes(include=['category', 'object']).columns
# print(df_numerical)
# print(df_object)
# print(df_discreet)
# print(df_categorical_features)

# Part 4 - Feature Engineering

## Derived Variables

In [None]:
# derived variables coding

## Categorical Encoding

In [None]:
# categorical encoding

In [None]:
# check that everything is numerical

In [None]:
# df.to_csv('data science fiction ii pt 3.csv', index=False)

# Part 5 - Feature Selection

In [None]:
# # get data
# import pandas as pd

# df = pd.read_csv('data science fiction ii pt 3.csv')
# print(df.shape)
# print(df.info())
# df.head()

## Train Test Split

random_state was initialized in the first code cell

In [None]:
# from sklearn.model_selection import train_test_split

# X_train, X_test, y_train, y_test = train_test_split(df.drop('class', axis=1), df['class'], test_size=0.3, random_state=random_state)
# X_train.shape, X_test.shape

## Mutual Information

In [None]:
# # mutual information
# import matplotlib.pyplot as plt
# from sklearn.feature_selection import mutual_info_classif

# mi = mutual_info_classif(X_train, y_train)
# mi = pd.Series(mi)
# mi.index = X_train.columns
# mi.sort_values(ascending=False).plot.bar()
# plt.ylabel('Mutual Information');

In [None]:
# mi_keepers = mi.sort_values(ascending=False).index[:5]
# print(mi_keepers)

## SelectKBest

In [None]:
# # SelectKBest
# from sklearn.feature_selection import SelectKBest, f_regression, f_classif

# selector = SelectKBest(f_classif, k=5) # Select the top 5 features
# X_new = selector.fit(X_train, y_train)

# kb_keepers = X_train.columns.values[selector.get_support()]
# print(kb_keepers)

## Select From Model

In [None]:
# # Select from model
# import numpy as np
# from sklearn.linear_model import LinearRegression, LogisticRegression
# from sklearn.feature_selection import SelectFromModel
# from sklearn.preprocessing import StandardScaler

# scaler = StandardScaler()
# scaler.fit(X_train)
# X_scaled = scaler.transform(X_train)

# selections = SelectFromModel(estimator=LogisticRegression()).fit(X_scaled, y_train)
# mt_keepers = X_train.columns.values[selections.get_support()]
# print(mt_keepers)

## Recursive Feature Elmination

In [None]:
# from sklearn.feature_selection import RFE
# from sklearn.linear_model import LinearRegression, LogisticRegression

# estimator = LogisticRegression()
# selector = RFE(estimator, n_features_to_select=5) # Select the top 5 features
# X_new = selector.fit_transform(X_scaled, y_train)
# rf_keepers = X_train.columns.values[selections.get_support()]
# print(rf_keepers)

## Random Forest Importance


In [None]:
# # random forest importance
# from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
# from sklearn.feature_selection import SelectFromModel

# selects = SelectFromModel(RandomForestClassifier(n_estimators=100, random_state=random_state), max_features=4)
# selects.fit(X_train, y_train)
# rfi = X_train.columns[(selects.get_support())]
# rfi.tolist()

## Review Previous Variables

* Correlated features
* VIF
* Outliers

In [None]:
# # make a list of features you have selected and use it as a filter
# features_to_model = []

In [None]:
# X_train = X_train[features_to_model]
# X_test = X_test[features_to_model]

# Part 6 - Data Modeling and Evaluation

## Logistic Regression

In [None]:
# # model, predict, evaluate, and plot
# from sklearn.linear_model import LogisticRegression
# from sklearn.metrics import confusion_matrix, accuracy_score

# model = LogisticRegression(solver='liblinear', random_state=random_state)
# model.fit(X_train, y_train)
# predictions = model.predict(X_test)

# train_accuracy = model.score(X_train, y_train)
# print(f"Training Accuracy: {train_accuracy:.4f}")
# test_accuracy = model.score(X_test, y_test)
# print(f"Testing Accuracy: {test_accuracy:.4f}")

## Model Evaluation

In [None]:
# from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# tn, fp, fn, tp = confusion_matrix(y_test, predictions).ravel()
# print('accuracy:', accuracy_score(y_test, predictions))
# # compare with other metrics

**The order of `y_test` (true labels) and `predictions` (predicted labels) matters significantly for `confusion_matrix` and `classification_report` in scikit-learn.**

However, for **`accuracy_score`**, the order does **not** matter because it simply calculates the proportion of correctly classified instances, regardless of which is considered the "true" and which is the "predicted" set in the function call.

Let's break down why the order is crucial for `confusion_matrix` and `classification_report`:

**1. `confusion_matrix(y_test, predictions)`:**

* The first argument (`y_test`) should always be the **true labels** (the actual values).
* The second argument (`predictions`) should always be the **predicted labels** (the values your model has outputted).

The output of `confusion_matrix` is a 2x2 (for binary classification) or NxN (for multi-class classification) array where:

* The rows correspond to the **true classes**.
* The columns correspond to the **predicted classes**.

Therefore, `confusion_matrix(y_test, predictions)` will produce a matrix where:

* `TN` (True Negative) is the count of instances where the true label was negative and the prediction was negative.
* `FP` (False Positive) is the count of instances where the true label was negative and the prediction was positive.
* `FN` (False Negative) is the count of instances where the true label was positive and the prediction was negative.
* `TP` (True Positive) is the count of instances where the true label was positive and the prediction was positive.

If you reverse the order and do `confusion_matrix(predictions, y_test)`, the rows and columns would effectively be swapped in terms of what they represent (true vs. predicted), leading to an incorrect interpretation of `TN`, `FP`, `FN`, and `TP`.

**2. `classification_report(y_test, predictions)`:**

* Similar to `confusion_matrix`, the first argument (`y_test`) must be the **true labels**, and the second argument (`predictions`) must be the **predicted labels**.

The `classification_report` provides a text summary of the precision, recall, F1-score, and support for each class. These metrics are calculated based on the true positives, true negatives, false positives, and false negatives, which are directly derived from the correct alignment of true and predicted labels. Swapping the order would lead to incorrect calculations of these metrics for each class.

**3. `accuracy_score(y_test, predictions)`:**

* For `accuracy_score`, the order does **not** matter. Accuracy is calculated as the number of correct predictions divided by the total number of predictions:

    `Accuracy = (Number of Correct Predictions) / (Total Number of Predictions)`

    Whether you compare `y_test` against `predictions` or `predictions` against `y_test`, the set of correctly matched instances will be the same, and the total number of instances remains the same. Therefore, the accuracy score will be identical regardless of the order of the arguments.

**In summary:**

* **`confusion_matrix`:** **Order matters.** Always use `confusion_matrix(y_test, predictions)`.
* **`classification_report`:** **Order matters.** Always use `classification_report(y_test, predictions)`.
* **`accuracy_score`:** **Order does not matter.** `accuracy_score(y_test, predictions)` will yield the same result as `accuracy_score(predictions, y_test)`.

It's crucial to maintain the correct order of true labels and predicted labels when using `confusion_matrix` and `classification_report` to ensure accurate evaluation of your classification model.

## Metrics

* tn = pred 0 actual 0
* fp = pred 1 actual 0
* fn = pred 0 actual 1
* tp = pred 1 actual 1
* acc(uracy) = $\frac{tn + tp}{total}$
* error = $\frac{fp + fn}{total}$
* prev(alence) = $\frac{fn + tp}{total}$
* queue = $\frac{fp + tp}{total}$
* tpr = $\frac{tp}{tp + fn}$
    * true positive rate
    * recall
    * sensitivity
    * prob of detection
    * 1 - fnr
* fnr = $\frac{fn}{tp + fn}$
    * false negative rate
    * type II error
    * 1 - tpr
* tnr = $\frac{tn}{tn + fp}$
    * true negative rate
    * specificity
    * 1 - fpr
* fpr = $\frac{fp}{tn + fp}$
    * false positive rate
    * type I error
    * fall out
    * prob of false claim
    * 1 - tnr
* ppv = $\frac{tp}{tp + fp}$
    * positive predicted value
    * precision
    * 1 - fdr
* fdr = $\frac{fp}{tp + fp}$
    * false discovery rate
    * 1 - ppv
* npv = $\frac{tn}{tn + fn}$
    * negative predicted value
    * 1 - for
* for = $\frac{fn}{tn + fn}$
    * false omission rate
    * 1 - npv
* liklihood ratio+ (lr+) = $\frac{tpr}{fpr}$
    * roc
* liklihood ratio- (lr-) = $\frac{fnr}{tnr}$
* diagnostic odds ratio = $\frac{lr+}{lr-}$
* f1 score = 2 * $\frac{precision-recall}{precision+recall}$
* Youden's J = sensitivity + specificity - 1 = tpr - fpr
* Matthew's Correlation Coefficient = $\frac{(tp*tn)-(fp*tp)}{\sqrt{(tp+fp)(tp+fn)(tn+fp)(tn+fn)}}$
  

## Confusion Matrix

In [None]:
# print(confusion_matrix(y_test, predictions))

### Explanation

Please explain what the Confusion Matrix is telling you

## Precision Recall

In [None]:
# print(classification_report(y_test, predictions))

### Explanation

Please explain what precision and recall are telling you in the classification report

## Bias Variance

In [None]:
# from mlxtend.evaluate import bias_variance_decomp

# avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(
#     model,
#     X_train.values, # Convert X_train to NumPy array
#     y_train.values, # Convert y_train to NumPy array
#     X_test.values, # Convert X_test to NumPy array
#     y_test.values, # Convert y_test to NumPy array
#     loss='0-1_loss',
#     random_seed=random_state)

# print('Average expected loss: %.3f' % avg_expected_loss)
# print('Average bias: %.3f' % avg_bias)
# print('Average variance: %.3f' % avg_var)


### Explanation

Please explain how to interpret bias variance and what it means to your model

# Coming Soon - Making Predictions and Gradio