Machine Learning

# Introduction to Machine Learning
## Definition and Overview
### What is Machine Learning?
### History and Evolution
### Key Concepts and Terminology
## Types of Machine Learning
### Supervised Learning
#### Definition and Examples
#### Use Cases
#### Algorithms (e.g., `LinearRegression` from `sklearn.linear_model`)
### Unsupervised Learning
#### Definition and Examples
#### Use Cases
#### Algorithms (e.g., `KMeans` from `sklearn.cluster`)
### Semi-supervised Learning
#### Definition and Examples
#### Use Cases
#### Algorithms (e.g., Self-Training from `sklearn.semi_supervised`)
### Reinforcement Learning
#### Definition and Examples
#### Use Cases
#### Algorithms (e.g., Q-Learning, Deep Q-Networks with `tensorflow` or `pytorch`)
## Applications of Machine Learning
### Healthcare
### Finance
### Retail
### Autonomous Vehicles
### Natural Language Processing

# Data Preprocessing
## Data Cleaning
### Handling Missing Values
#### Mean/Median Imputation (e.g., `SimpleImputer` from `sklearn.impute`)
#### Dropping Missing Values (e.g., `dropna()` from `pandas`)
#### Filling with Forward/Backward Fill (e.g., `fillna(method='ffill')` from `pandas`)
### Handling Outliers
#### Z-Score Method (e.g., `scipy.stats.zscore`)
#### IQR Method (e.g., using `quantile` from `pandas`)
#### Winsorization (e.g., `winsorize` from `scipy.stats.mstats`)
## Data Transformation
### Encoding Categorical Variables
#### One-Hot Encoding (e.g., `OneHotEncoder` from `sklearn.preprocessing`)
#### Label Encoding (e.g., `LabelEncoder` from `sklearn.preprocessing`)
#### Binary Encoding (e.g., `binary` from `category_encoders`)
### Feature Scaling
#### Normalization (Min-Max Scaling) (e.g., `MinMaxScaler` from `sklearn.preprocessing`)
#### Standardization (Z-Score Scaling) (e.g., `StandardScaler` from `sklearn.preprocessing`)
### Feature Engineering
#### Creating New Features (e.g., using `pandas`)
#### Polynomial Features (e.g., `PolynomialFeatures` from `sklearn.preprocessing`)
#### Interaction Features (e.g., using `pandas` and custom functions)
#### Log Transformations (e.g., `numpy.log`)

# Exploratory Data Analysis (EDA)
## Descriptive Statistics
### Measures of Central Tendency (Mean, Median, Mode) (e.g., `mean`, `median`, `mode` from `numpy` or `pandas`)
### Measures of Dispersion (Variance, Standard Deviation, Range) (e.g., `var`, `std` from `numpy` or `pandas`)
### Skewness and Kurtosis (e.g., `skew`, `kurtosis` from `scipy.stats`)
## Data Visualization Techniques
### Histograms (e.g., `hist` from `matplotlib.pyplot` or `seaborn`)
### Box Plots (e.g., `boxplot` from `matplotlib.pyplot` or `seaborn`)
### Scatter Plots (e.g., `scatter` from `matplotlib.pyplot` or `seaborn`)
### Pair Plots (e.g., `pairplot` from `seaborn`)
### Heatmaps (e.g., `heatmap` from `seaborn`)
### Violin Plots (e.g., `violinplot` from `seaborn`)
## Identifying Patterns and Relationships
### Correlation Analysis
#### Pearson Correlation (e.g., `corr` from `pandas`)
#### Spearman Correlation (e.g., `spearmanr` from `scipy.stats`)
### Trend Analysis (e.g., using `pandas` time series methods)
### Detecting Seasonality (e.g., using `statsmodels.tsa`)

# Supervised Learning
## Regression Algorithms
### Linear Regression
#### Assumptions
#### Model Building (e.g., `LinearRegression` from `sklearn.linear_model`)
#### Evaluation Metrics (MSE, RMSE, R²) (e.g., `mean_squared_error`, `r2_score` from `sklearn.metrics`)
### Polynomial Regression
#### Polynomial Features (e.g., `PolynomialFeatures` from `sklearn.preprocessing`)
#### Overfitting and Underfitting
### Ridge and Lasso Regression
#### Regularization Techniques (e.g., `Ridge`, `Lasso` from `sklearn.linear_model`)
#### Hyperparameter Tuning (α) (e.g., `GridSearchCV` from `sklearn.model_selection`)
### Support Vector Regression
#### Kernel Trick (e.g., `SVR` from `sklearn.svm`)
#### Epsilon-Insensitive Loss
## Classification Algorithms
### Logistic Regression
#### Sigmoid Function
#### Thresholding
#### ROC Curve and AUC (e.g., `roc_curve`, `auc` from `sklearn.metrics`)
### k-Nearest Neighbors (k-NN)
#### Distance Metrics (Euclidean, Manhattan) (e.g., `KNeighborsClassifier` from `sklearn.neighbors`)
#### Choosing k
### Support Vector Machines (SVM)
#### Linear SVM
#### Kernel SVM (RBF, Polynomial) (e.g., `SVC` from `sklearn.svm`)
#### Hyperparameters (C, Gamma)
### Decision Trees
#### Splitting Criteria (Gini, Entropy) (e.g., `DecisionTreeClassifier` from `sklearn.tree`)
#### Pruning Techniques
### Random Forests
#### Bagging (e.g., `RandomForestClassifier` from `sklearn.ensemble`)
#### Feature Importance
### Gradient Boosting
#### Boosting Principle
#### Variants (AdaBoost, Gradient Boosting, XGBoost, LightGBM) (e.g., `GradientBoostingClassifier`, `XGBClassifier`, `LGBMClassifier`)
### Neural Networks
#### Perceptrons
#### Multilayer Perceptrons (MLP) (e.g., `MLPClassifier` from `sklearn.neural_network`)
#### Activation Functions (ReLU, Sigmoid, Tanh)

# Unsupervised Learning
## Clustering Algorithms
### k-Means Clustering
#### Choosing k (Elbow Method, Silhouette Score)
#### Initial Centroid Selection (e.g., `KMeans` from `sklearn.cluster`)
### Hierarchical Clustering
#### Agglomerative vs. Divisive
#### Dendrograms (e.g., `AgglomerativeClustering` from `sklearn.cluster`, `dendrogram` from `scipy.cluster.hierarchy`)
### DBSCAN
#### Density-Based Clustering
#### Parameters (Epsilon, MinPts) (e.g., `DBSCAN` from `sklearn.cluster`)
## Dimensionality Reduction
### Principal Component Analysis (PCA)
#### Eigenvalues and Eigenvectors
#### Explained Variance Ratio (e.g., `PCA` from `sklearn.decomposition`)
### t-Distributed Stochastic Neighbor Embedding (t-SNE)
#### Perplexity
#### Use Cases (e.g., `TSNE` from `sklearn.manifold`)
### Linear Discriminant Analysis (LDA)
#### Maximizing Class Separability
#### Comparison with PCA (e.g., `LinearDiscriminantAnalysis` from `sklearn.discriminant_analysis`)

# Semi-Supervised Learning
## Self-Training
### Pseudo-Labeling
#### Confidence Thresholds
#### Iterative Refinement
## Co-Training
### View Selection
#### Independent Feature Sets
## Graph-Based Semi-Supervised Learning
### Label Propagation (e.g., `LabelPropagation` from `sklearn.semi_supervised`)
### Graph Convolutional Networks

# Reinforcement Learning
## Introduction to Reinforcement Learning
### Basics of RL
### Key Concepts: Agent, Environment, State, Action, Reward
## Markov Decision Processes (MDP)
### States and Actions
### Transition Probabilities
### Rewards
## Q-Learning
### Q-Table
### Bellman Equation
### Exploration vs. Exploitation
## Deep Q-Networks (DQN)
### Neural Network Architecture (e.g., using `tensorflow` or `pytorch`)
### Experience Replay
### Target Networks
## Policy Gradient Methods
### REINFORCE Algorithm
### Actor-Critic Methods

# Model Evaluation and Selection

## Train/Test Split

### Holdout Validation

### Stratified Sampling

## Cross-Validation
https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation

### k-Fold Cross-Validation

#### using cross_val_score

In [None]:
from sklearn.model_selection import cross_val_score

# load data
df = pd.read_csv('data/movie_reviews.csv')

# import model to evaluate
model = LinearRegression()

# define X and y from dataset
y = df['target']
X = df.drop(columns='target')

# use cross_val_score to get scores
scores = cross_val_score(model, X, y, cv=5)

# average out to get final answer
scores.mean()

#### using cross_validate
- It allows specifying multiple metrics for evaluation.
- It returns a dict containing fit-times, score-times (and optionally training scores, fitted estimators, train-test split indices) in addition to the test score.

In [None]:
# using cross_validate
from sklearn.model_selection import cross_validate

# load data
df = pd.read_csv('data/movie_reviews.csv')

# import model to evaluate
model = LinearRegression()

# define X and y from dataset
y = df['target']
X = df.drop(columns='target')

# set scoring metrics you want to look at. can be tuple as well
# can also shove in a dict where the keys are the metric names and the values are the scores 
scoring = ['r2', 'neg_mean_squared_error']

# use cross_val_score to get scores
scores = cross_validate(model, X, y, cv=5, scoring=scoring)

# average out to get final answer
scores.mean()

#### Advantages and Disadvantages

### Leave-One-Out Cross-Validation

#### Use Cases

## Evaluation Metrics for Regression

### Mean Squared Error (MSE)

### Root Mean Squared Error (RMSE)

### Mean Absolute Error (MAE)

### R² Score

## Evaluation Metrics for Classification

### Accuracy
- Sum of the correct predictions divided by the sum of the overall number of predictions
- Ratio of correct predictions

### Precision
- Measures the ability of a model to avoid false alarms for a class, or the confidence of a model when predicting a specific class.
- Ability to flag correctly

### Recall
- Measures the ability of the model to detect occurrences of a class.
- Ability to flag

### F1 Score
- A combination of precision and recall into a single metric.
- 

### Precision Recall Curve

In [2]:
# YOUR CODE HERE
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.preprocessing import RobustScaler

# load model
model = LogisticRegression(max_iter=2000, class_weight = 'balanced')

# define X and y
X = data.drop(columns='Class')
y = data.Class

# scale X
scaler = RobustScaler()
X_scaled = scaler.fit_transform(X)

# Get baseline score for whatever metric. Recall in this case
scoring = 'recall'
scores = cross_val_score(model, X_scaled, y, cv=5, scoring=scoring)
scores.mean()

NameError: name 'data' is not defined

##### get the threshold for 90% recall

In [None]:
# YOUR CODE HERE
from sklearn.metrics import precision_recall_curve
import numpy as np

# get preds
# preds returns: probs for class 0 and class 1
# remember to do method = 'predict_proba' for probabilities. otherwise predicts straight up class
preds = cross_val_predict(model, X_scaled, y, cv=5, method = "predict_proba")
# get pred for class 1
y_pred_1 = preds[:, 1]

# use precision_recall_curve to get precision, recall, threshold
precision, recall, threshold = precision_recall_curve(y, y_pred_1)

# get all the indexes where recall is 90%+ 
above90 = np.where(recall >= 0.90)

# Recall = ability to find all true positives
# Recall DECREASES as threshold increases -> find max() threshold 
# Precision = ability to avoid false alarms
# Precision INCREASES as threshold increases -> find min()
threshold_index = above90[0].max()
threshold[threshold_index]

### ROC Curve and AUC

### Confusion Matrix

## Overfitting and Underfitting

### Bias-Variance Tradeoff

### Techniques to Mitigate Overfitting (Regularization, Pruning, Early Stopping)

## Model Selection and Hyperparameter Tuning

### Grid Search

In [4]:
data = pd.read_csv('data/movie_reviews.csv')
data

Unnamed: 0.1,Unnamed: 0,target,reviews,clean_reviews,target_encoded
0,0,neg,"plot : two teen couples go to a church party ,...",plot two teen couple go to a church party drin...,0
1,1,neg,the happy bastard's quick movie review \ndamn ...,the happy bastard quick movie review damn that...,0
2,2,neg,it is movies like these that make a jaded movi...,it is movie like these that make a jaded movie...,0
3,3,neg,""" quest for camelot "" is warner bros . ' firs...",quest for camelot is warner bros first feature...,0
4,4,neg,synopsis : a mentally unstable man undergoing ...,synopsis a mentally unstable man undergoing ps...,0
...,...,...,...,...,...
1995,1995,pos,wow ! what a movie . \nit's everything a movie...,wow what a movie it everything a movie can be ...,1
1996,1996,pos,"richard gere can be a commanding actor , but h...",richard gere can be a commanding actor but he ...,1
1997,1997,pos,"glory--starring matthew broderick , denzel was...",glorystarring matthew broderick denzel washing...,1
1998,1998,pos,steven spielberg's second epic film on world w...,steven spielberg second epic film on world war...,1


In [5]:
#imports
import pandas as pd
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import set_config; set_config("diagram")

# import data
data = pd.read_csv('data/movie_reviews.csv')

# Create Pipeline
vectorizer = TfidfVectorizer()
model = MultinomialNB()

pipeline_tfidf = Pipeline([
    ('tfidf', vectorizer), 
    ('nb', model)
])

# Set parameters you want to search through
parameters = {
    'tfidf__ngram_range': ((1,1), (2,2), (3,3)),
    'tfidf__min_df': (0.01,0.05),
    'tfidf__max_df': (0.8,0.9),
    'nb__alpha': (0.01,0.1,1,10)
}

# Perform grid search on pipeline
grid_search = GridSearchCV(
    pipeline_tfidf,
    parameters,
    scoring = "recall",
    cv = 5,
    n_jobs=-1,
    verbose=1
)

# fit it on the data
grid_search.fit(data.clean_reviews, data.target_encoded)

Fitting 5 folds for each of 48 candidates, totalling 240 fits


In [6]:
# get the best score
grid_search.best_score_

0.853

In [7]:
# get best params
grid_search.best_params_

{'nb__alpha': 10,
 'tfidf__max_df': 0.9,
 'tfidf__min_df': 0.01,
 'tfidf__ngram_range': (2, 2)}

In [8]:
# get the best estimator
grid_search.best_estimator_

### Random Search
### Bayesian Optimization

# Ensemble Learning
## Bagging
### Bootstrap Aggregating
### Random Forests
#### Out-of-Bag Error
## Boosting
### AdaBoost
#### Weight Updates
### Gradient Boosting
#### Gradient Descent
#### Learning Rate
### XGBoost
#### Regularization Parameters
### LightGBM
#### Leaf-wise Growth
## Stacking
### Base Models
### Meta-Model
## Voting Classifiers
### Hard Voting

# Natural Language Processing (NLP)

## Text Preprocessing
  ### Tokenization (e.g., `word_tokenize` from `nltk`, `Tokenizer` from `keras.preprocessing.text`)

- Word Tokenization
- Sentence Tokenization

  ### Stop Words Removal (e.g., `stopwords` from `nltk.corpus`)

  ### Stemming (e.g., `PorterStemmer` from `nltk.stem`)

  ### Lemmatization (e.g., `WordNetLemmatizer` from `nltk.stem`)

  ### N-grams

## Feature Extraction

  ### Bag-of-Words (BoW)

#### Count Vectorization (e.g., `CountVectorizer` from `sklearn.feature_extraction.text`)

##### Vector Representation

In [5]:
texts = [
    'the young dog is running with the cat',
    'running is good for your health',
    'your cat is young',
    'young young young young young cat cat cat'
]

from sklearn.feature_extraction.text import CountVectorizer

count_vectorizer = CountVectorizer()
X = count_vectorizer.fit_transform(texts)
X

<4x11 sparse matrix of type '<class 'numpy.int64'>'
	with 19 stored elements in Compressed Sparse Row format>

In [6]:
X.toarray()

array([[1, 1, 0, 0, 0, 1, 1, 2, 1, 1, 0],
       [0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1],
       [1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1],
       [3, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0]])

In [7]:
count_vectorizer.get_feature_names_out()

array(['cat', 'dog', 'for', 'good', 'health', 'is', 'running', 'the',
       'with', 'young', 'your'], dtype=object)

In [8]:
# convert this to that grid thing

import pandas as pd

vectorized_texts = pd.DataFrame(
    X.toarray(),
    columns = count_vectorizer.get_feature_names_out(),
    index = texts
)

vectorized_texts

Unnamed: 0,cat,dog,for,good,health,is,running,the,with,young,your
the young dog is running with the cat,1,1,0,0,0,1,1,2,1,1,0
running is good for your health,0,0,1,1,1,1,1,0,0,0,1
your cat is young,1,0,0,0,0,1,0,0,0,1,1
young young young young young cat cat cat,3,0,0,0,0,0,0,0,0,5,0


##### Sparsity Issues

  ### Term Frequency-Inverse Document Frequency (TF-IDF)
   

#### TF-IDF Vectorization (e.g., `TfidfVectorizer` from `sklearn.feature_extraction.text`)
      - Calculation
      - Applications

In [3]:
import pandas as pd

data = pd.read_csv('data/movie_reviews.csv')
data.head()

Unnamed: 0.1,Unnamed: 0,target,reviews,clean_reviews,target_encoded
0,0,neg,"plot : two teen couples go to a church party ,...",plot two teen couple go to a church party drin...,0
1,1,neg,the happy bastard's quick movie review \ndamn ...,the happy bastard quick movie review damn that...,0
2,2,neg,it is movies like these that make a jaded movi...,it is movie like these that make a jaded movie...,0
3,3,neg,""" quest for camelot "" is warner bros . ' firs...",quest for camelot is warner bros first feature...,0
4,4,neg,synopsis : a mentally unstable man undergoing ...,synopsis a mentally unstable man undergoing ps...,0


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB

In [5]:
# applying to column
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(data.clean_reviews)
X_tfidf

<2000x41596 sparse matrix of type '<class 'numpy.float64'>'
	with 643210 stored elements in Compressed Sparse Row format>

In [1]:
# using pipeline

# Pipeline is probably simply better
pipeline_tfidf_nb = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('nb', MultinomialNB())
])

# # make_pipeline is quicker but less customizable. can't give it the shortcuts for params:
# pipeline_tfidf_nb = Pipeline(
#     TfidfVectorizer(),
#     MultinomialNB())
# )

NameError: name 'Pipeline' is not defined

In [None]:
# cross-validate

cv_results = cross_validate(pipeline_tfidf_nb, X, y, cv=5, scoring='accuracy')
cv_results['test_score'].mean()

  ### Word Embeddings

   #### Word2Vec (e.g., `Word2Vec` from `gensim.models`)
        - CBOW and Skip-Gram

   #### GloVe (Global Vectors for Word Representation)

   #### FastText

## Text Classification
  ### Naive Bayes Classifier
    - Multinomial Naive Bayes (e.g., `MultinomialNB` from `sklearn.naive_bayes`)
  ### Support Vector Machines (SVM)
    - Linear SVM for Text Classification (e.g., `LinearSVC` from `sklearn.svm`)
  ### Recurrent Neural Networks (RNN)
    - Long Short-Term Memory Networks (LSTM) (e.g., `LSTM` from `keras.layers`)
    - Gated Recurrent Units (GRU)
  ### Transformer Models
    - Encoder-Decoder Architecture
    - BERT (e.g., `transformers.BertModel` from `transformers` library)
    - GPT (e.g., `transformers.GPT2Model` from `transformers` library)

## Text Generation
  ### Recurrent Neural Networks (RNN)
    - Character-Level RNNs
  ### Transformer Models
    - GPT-3, GPT-4 (e.g., `OpenAI API`)
  ### Retrieval-Augmented Generation (RAG)
    - Combining Retrieval and Generation (e.g., `transformers.RagTokenizer`, `transformers.RagModel` from `transformers` library)
    

# Time Series Analysis
- Time Series Decomposition
  - Additive and Multiplicative Models
  - Trend, Seasonality, and Residuals
- ARIMA Models
  - Autoregressive (AR)
  - Integrated (I)
  - Moving Average (MA)
  - ARIMA Model Building
- Exponential Smoothing
  - Simple Exponential Smoothing
  - Holt’s Linear Trend Model
  - Holt-Winters Seasonal Model
- Long Short-Term Memory (LSTM) for Time Series
  - Sequence Prediction
  - Handling Long Sequences

# Anomaly Detection
- Techniques and Algorithms
  - Statistical Methods
    - Z-Score (e.g., `scipy.stats.zscore`)
    - Grubbs' Test
  - Proximity-Based Methods
    - k-Nearest Neighbors (k-NN) (e.g., `LocalOutlierFactor` from `sklearn.neighbors`)
    - DBSCAN (e.g., `DBSCAN` from `sklearn.cluster`)
  - Clustering-Based Methods
    - k-Means Clustering (e.g., `KMeans` from `sklearn.cluster`)
    - Isolation Forest (e.g., `IsolationForest` from `sklearn.ensemble`)
  - Machine Learning-Based Methods
    - One-Class SVM (e.g., `OneClassSVM` from `sklearn.svm`)
    - Autoencoders (e.g., using `tensorflow.keras` or `pytorch`)

# Model Deployment
- Saving and Loading Models
  - Using Pickle (e.g., `pickle.dump`, `pickle.load`)
  - Using Joblib (e.g., `joblib.dump`, `joblib.load`)
- Model Serving
  - REST APIs (Flask, FastAPI)
    - Building API Endpoints (e.g., `flask.Flask`, `fastapi.FastAPI`)
    - Handling Requests and Responses
  - Web Services (Django, Flask)
    - Integrating Machine Learning Models
    - Handling User Inputs (e.g., `django.http`, `django.views`)
  - Cloud Services (AWS SageMaker, Google AI Platform)
    - Deploying Models
    - Monitoring and Scaling
- Monitoring and Maintenance
  - Model Performance Monitoring
    - Tracking Metrics Over Time
    - Setting Alerts for Degradation
  - A/B Testing
    - Designing Experiments
    - Analyzing Results
  - Model Retraining
    - Triggering Retraining
    - Automating Pipelines

# Tools and Libraries
- Python Libraries
  - NumPy (e.g., `numpy.array`, `numpy.linalg`)
  - pandas (e.g., `pandas.DataFrame`, `pandas.Series`)
  - scikit-learn (e.g., `sklearn.preprocessing`, `sklearn.model_selection`)
  - TensorFlow (e.g., `tensorflow.keras`, `tensorflow.data`)
  - Keras (e.g., `keras.models`, `keras.layers`)
  - PyTorch (e.g., `torch.Tensor`, `torch.nn`)
  - SciPy (e.g., `scipy.stats`, `scipy.optimize`)
- Data Visualization Libraries
  - Matplotlib (e.g., `matplotlib.pyplot.plot`, `matplotlib.pyplot.show`)
  - Seaborn (e.g., `seaborn.scatterplot`, `seaborn.heatmap`)
  - Plotly (e.g., `plotly.graph_objs`, `plotly.express`)
  - Bokeh (e.g., `bokeh.plotting.figure`, `bokeh.io.show`)
  - Altair (e.g., `alt.Chart`, `alt.data_transformers`)
- Tools for Model Deployment
  - Flask (e.g., `flask.Flask`, `flask.request`)
  - Django (e.g., `django.http`, `django.views`)
  - FastAPI (e.g., `fastapi.FastAPI`, `fastapi.Request`)
  - Docker (e.g., Dockerfiles, `docker-compose.yml`)
  - Kubernetes (e.g., Pods, Deployments, Services)

# Ethical and Responsible AI
- Bias and Fairness
  - Identifying and Mitigating Bias
    - Data Bias Detection (e.g., `sklearn.metrics` fairness metrics)
    - Algorithmic Fairness (e.g., Fairlearn toolkit)
  - Fairness Metrics
    - Demographic Parity
    - Equalized Odds
- Explainability and Interpretability
  - LIME (Local Interpretable Model-agnostic Explanations)
    - Using LIME (e.g., `lime.lime_tabular`)
  - SHAP (SHapley Additive exPlanations)
    - Using SHAP (e.g., `shap.TreeExplainer`, `shap.KernelExplainer`)
  - Model-Specific Methods
    - Feature Importance in Trees (e.g., `feature_importances_` in `sklearn.ensemble` models)
- Privacy and Security
  - Differential Privacy
    - Adding Noise to Data
    - Privacy-Preserving Mechanisms
  - Federated Learning
    - Training Across Multiple Devices
    - Aggregating Results Securely
  - Secure Multi-Party Computation
    - Techniques and Protocols
- Ethical Considerations in AI
  - Ethical Guidelines (e.g., IEEE, ACM)
  - Responsible AI Practices
    - Transparency
    - Accountability
  - Case Studies and Best Practices
    - Real-World Examples
    - Lessons Learned

In [6]:
import pandas as pd
import scipy
import numpy as np
from sklearn.preprocessing import MinMaxScaler
import seaborn as sns
import matplotlib.pyplot as plt

# Importing

In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
scaler = MinMaxScaler()

In [None]:
# another way. just different based on directory thing?
from sklearn import preprocessing
scaler = preprocessing.MinMaxScaler()

# Data Pre-processing

resources:
1. https://www.geeksforgeeks.org/data-preprocessing-machine-learning-python/?ref=header_search
2. https://www.geeksforgeeks.org/ml-feature-scaling-part-2/
3. http://localhost:8889/notebooks/tjyana/05-ML/02-Prepare-the-dataset/data-preprocessing-workflow/Preprocessing-Workflow.ipynb

In [38]:
# Load dataset
# others: load_diabetes, load_digits, load_boston, load_breast_cancer, load_linnerud, load_sample_image, load_sample_images, load_wine

# replacements for boston: 

# from sklearn.datasets import fetch_california_housing
# housing = fetch_california_housing()

# from sklearn.datasets import fetch_openml
# housing = fetch_openml(name="house_prices", as_frame=True)


from sklearn.datasets import fetch_california_housing
data = fetch_california_housing()

df = pd.DataFrame(data = data.data, columns = data.feature_names) # turn to df
df

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25
...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32


### Duplicate values

In [41]:
# Find duplicates
# df.duplicated() returns if duplicated or not 

# sum it with this
df.duplicated().sum()

0

In [42]:
# Remove
# df.drop_duplicates()

df = df.drop_duplicates()
df

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25
...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32


### Missing values 

#### Identifying missing values

In [43]:
# df.isnull()

df.isnull().sum()

MedInc        0
HouseAge      0
AveRooms      0
AveBedrms     0
Population    0
AveOccup      0
Latitude      0
Longitude     0
dtype: int64

#### Dealing with missing values

##### remove rows

In [None]:
# remove rows with missing values 
df.dropna()

##### drop the column entirely

In [None]:
# df.drop(columns='COLUMN_NAME', inplace=True) 
# CHECK THIS ONE BC NOT SURE

##### imput

In [None]:
# imput missing values with mean, median, or mode 
df.fillna(df.mean())

In [None]:
# imput with other data
# df['COLUMN_NAME'].replace(np.nan, 'NEW_VALUE', inplace=True)

In [45]:
# use imputers
# SimpleImputer

In [1]:

        
2. Encode categorical variables
    - identify `df.select_dtypes(include=['object']).columns`
    - choose encoding method
        - one-hot
            `pd.get_dummies(df, columns=caegorical_cols)`
        - label
            ```from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
for col in categorical_cols:
    df[col] = label_encoder.fit_transform(df[col])```
3. Feature scaling
    - import `from sklearn.preprocessing import StandardScaler`
    - choose scaling method
        - StandardScaler
            ```scaler = StandardScaler()
df_scaled = scaler.fit_transform(df)```
        - MinMaxScaler
            ```from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df_scaled = scaler.fit_transform(df)```

SyntaxError: invalid syntax (1553292181.py, line 1)

# HEY LET'S TRY DOING ONE FULL PREPROCESS CYCLE BEFORE TAKING NOTES

# YOU'RE TAKING TOO LONG

In [None]:
from sklearn.model_selection import train_test_split
df = df.fillna(df.mean())
train, test = train_test_split(df)
train