# Machine Learning

## Introduction to Machine Learning
- Definition and Overview
  - What is Machine Learning?
  - History and Evolution
  - Key Concepts and Terminology
- Types of Machine Learning
  - Supervised Learning
    - Definition and Examples
    - Use Cases
    - Algorithms
  - Unsupervised Learning
    - Definition and Examples
    - Use Cases
    - Algorithms
  - Semi-supervised Learning
    - Definition and Examples
    - Use Cases
    - Algorithms
  - Reinforcement Learning
    - Definition and Examples
    - Use Cases
    - Algorithms
- Applications of Machine Learning
  - Healthcare
  - Finance
  - Retail
  - Autonomous Vehicles
  - Natural Language Processing

## Data Preprocessing
- Data Cleaning
  - Handling Missing Values
    - Mean/Median Imputation
    - Dropping Missing Values
    - Filling with Forward/Backward Fill
  - Handling Outliers
    - Z-Score Method
    - IQR Method
    - Winsorization
- Data Transformation
  - Encoding Categorical Variables
    - One-Hot Encoding
    - Label Encoding
    - Binary Encoding
  - Feature Scaling
    - Normalization (Min-Max Scaling)
    - Standardization (Z-Score Scaling)
  - Feature Engineering
    - Creating New Features
    - Polynomial Features
    - Interaction Features
    - Log Transformations

## Exploratory Data Analysis (EDA)
- Descriptive Statistics
  - Measures of Central Tendency (Mean, Median, Mode)
  - Measures of Dispersion (Variance, Standard Deviation, Range)
  - Skewness and Kurtosis
- Data Visualization Techniques
  - Histograms
  - Box Plots
  - Scatter Plots
  - Pair Plots
  - Heatmaps
  - Violin Plots
- Identifying Patterns and Relationships
  - Correlation Analysis
    - Pearson Correlation
    - Spearman Correlation
  - Trend Analysis
  - Detecting Seasonality

## Supervised Learning
- Regression Algorithms
  - Linear Regression
    - Assumptions
    - Model Building
    - Evaluation Metrics (MSE, RMSE, R²)
  - Polynomial Regression
    - Polynomial Features
    - Overfitting and Underfitting
  - Ridge and Lasso Regression
    - Regularization Techniques
    - Hyperparameter Tuning (α)
  - Support Vector Regression
    - Kernel Trick
    - Epsilon-Insensitive Loss
- Classification Algorithms
  - Logistic Regression
    - Sigmoid Function
    - Thresholding
    - ROC Curve and AUC
  - k-Nearest Neighbors (k-NN)
    - Distance Metrics (Euclidean, Manhattan)
    - Choosing k
  - Support Vector Machines (SVM)
    - Linear SVM
    - Kernel SVM (RBF, Polynomial)
    - Hyperparameters (C, Gamma)
  - Decision Trees
    - Splitting Criteria (Gini, Entropy)
    - Pruning Techniques
  - Random Forests
    - Bagging
    - Feature Importance
  - Gradient Boosting
    - Boosting Principle
    - Variants (AdaBoost, Gradient Boosting, XGBoost, LightGBM)
  - Neural Networks
    - Perceptrons
    - Multilayer Perceptrons (MLP)
    - Activation Functions (ReLU, Sigmoid, Tanh)

## Unsupervised Learning
- Clustering Algorithms
  - k-Means Clustering
    - Choosing k (Elbow Method, Silhouette Score)
    - Initial Centroid Selection
  - Hierarchical Clustering
    - Agglomerative vs. Divisive
    - Dendrograms
  - DBSCAN
    - Density-Based Clustering
    - Parameters (Epsilon, MinPts)
- Dimensionality Reduction
  - Principal Component Analysis (PCA)
    - Eigenvalues and Eigenvectors
    - Explained Variance Ratio
  - t-Distributed Stochastic Neighbor Embedding (t-SNE)
    - Perplexity
    - Use Cases
  - Linear Discriminant Analysis (LDA)
    - Maximizing Class Separability
    - Comparison with PCA

## Semi-Supervised Learning
- Self-Training
  - Pseudo-Labeling
    - Confidence Thresholds
    - Iterative Refinement
- Co-Training
  - View Selection
    - Independent Feature Sets
- Graph-Based Semi-Supervised Learning
  - Label Propagation
  - Graph Convolutional Networks

## Reinforcement Learning
- Introduction to Reinforcement Learning
  - Basics of RL
  - Key Concepts: Agent, Environment, State, Action, Reward
- Markov Decision Processes (MDP)
  - States and Actions
  - Transition Probabilities
  - Rewards
- Q-Learning
  - Q-Table
  - Bellman Equation
  - Exploration vs. Exploitation
- Deep Q-Networks (DQN)
  - Neural Network Architecture
  - Experience Replay
  - Target Networks
- Policy Gradient Methods
  - REINFORCE Algorithm
  - Actor-Critic Methods

## Model Evaluation and Selection
- Train/Test Split
  - Holdout Validation
  - Stratified Sampling
- Cross-Validation
  - k-Fold Cross-Validation
    - Advantages and Disadvantages
  - Leave-One-Out Cross-Validation
    - Use Cases
- Evaluation Metrics for Regression
  - Mean Squared Error (MSE)
  - Root Mean Squared Error (RMSE)
  - Mean Absolute Error (MAE)
  - R² Score
- Evaluation Metrics for Classification
  - Accuracy
  - Precision
  - Recall
  - F1 Score
  - ROC Curve and AUC
  - Confusion Matrix
- Overfitting and Underfitting
  - Bias-Variance Tradeoff
  - Techniques to Mitigate Overfitting (Regularization, Pruning, Early Stopping)
- Model Selection and Hyperparameter Tuning
  - Grid Search
  - Random Search
  - Bayesian Optimization

## Ensemble Learning
- Bagging
  - Bootstrap Aggregating
  - Random Forests
    - Out-of-Bag Error
- Boosting
  - AdaBoost
    - Weight Updates
  - Gradient Boosting
    - Gradient Descent
    - Learning Rate
  - XGBoost
    - Regularization Parameters
  - LightGBM
    - Leaf-wise Growth
- Stacking
  - Base Models
  - Meta-Model
- Voting Classifiers
  - Hard Voting
  - Soft Voting

## Neural Networks and Deep Learning
- Basics of Neural Networks
  - Perceptrons
  - Multilayer Perceptrons (MLP)
    - Forward Propagation
    - Backpropagation
- Activation Functions
  - Sigmoid
  - Tanh
  - ReLU
  - Leaky ReLU
- Training Neural Networks
  - Gradient Descent
    - Batch Gradient Descent
    - Stochastic Gradient Descent
    - Mini-Batch Gradient Descent
  - Optimizers
    - Adam
    - RMSprop
    - Adagrad
- Convolutional Neural Networks (CNNs)
  - Convolutional Layers
  - Pooling Layers
  - Fully Connected Layers
  - Dropout Regularization
- Recurrent Neural Networks (RNNs)
  - Basic RNN
  - Long Short-Term Memory (LSTM) Networks
  - Gated Recurrent Unit (GRU)
- Autoencoders
  - Basic Autoencoder
  - Denoising Autoencoder
  - Variational Autoencoder (VAE)
- Generative Adversarial Networks (GANs)
  - Generator and Discriminator
  - Training GANs
  - Applications (Image Generation, Style Transfer)

## Natural Language Processing (NLP)
- Text Preprocessing
  - Tokenization
    - Word Tokenization
    - Sentence Tokenization
  - Lemmatization
  - Stemming
  - Stop Words Removal
  - N-grams
- Bag-of-Words (BoW)
  - Vector Representation
  - Sparsity Issues
- Term Frequency-Inverse Document Frequency (TF-IDF)
  - Calculation
  - Applications
- Word Embeddings
  - Word2Vec
    - CBOW and Skip-Gram
  - GloVe
  - FastText
- Sequence Models
  - Recurrent Neural Networks (RNN)
  - Long Short-Term Memory (LSTM)
  - Gated Recurrent Unit (GRU)
- Attention Mechanisms and Transformers
  - Attention Mechanisms
    - Self-Attention
    - Multi-Head Attention
  - Transformers
    - Encoder-Decoder Architecture
    - BERT
    - GPT
- Retrieval-Augmented Generation (RAG)
  - Overview
  - Applications

## Time Series Analysis
- Time Series Decomposition
  - Additive and Multiplicative Models
  - Trend, Seasonality, and Residuals
- ARIMA Models
  - Autoregressive (AR)
  - Integrated (I)
  - Moving Average (MA)
  - ARIMA Model Building
- Exponential Smoothing
  - Simple Exponential Smoothing
  - Holt’s Linear Trend Model
  - Holt-Winters Seasonal Model
- Long Short-Term Memory (LSTM) for Time Series
  - Sequence Prediction
  - Handling Long Sequences

## Anomaly Detection
- Statistical Methods
  - Z-Score
  - IQR
  - Moving Average
- Machine Learning Methods
  - Isolation Forest
  - One-Class SVM
  - k-Means Clustering
- Deep Learning Methods
  - Autoencoders
  - Generative Adversarial Networks (GANs)

## Model Deployment
- Saving and Loading Models
  - Using Pickle
  - Using Joblib
- Model Serving
  - REST APIs (Flask, FastAPI)
  - Web Services (Django, Flask)
  - Cloud Services (AWS SageMaker, Google AI Platform)
- Monitoring and Maintenance
  - Model Performance Monitoring
  - A/B Testing
  - Model Retraining

## Tools and Libraries
- Python Libraries
  - NumPy
  - pandas
  - scikit-learn
  - TensorFlow
  - Keras
  - PyTorch
  - SciPy
- Data Visualization Libraries
  - Matplotlib
  - Seaborn
  - Plotly
  - Bokeh
  - Altair
- Tools for Model Deployment
  - Flask
  - Django
  - FastAPI
  - Docker
  - Kubernetes

## Ethical and Responsible AI
- Bias and Fairness
  - Identifying and Mitigating Bias
  - Fairness Metrics
- Explainability and Interpretability
  - LIME (Local Interpretable Model-agnostic Explanations)
  - SHAP (SHapley Additive exPlanations)
  - Model-Specific Methods (e.g., Feature Importance in Trees)
- Privacy and Security
  - Differential Privacy
  - Federated Learning
  - Secure Multi-Party Computation
- Ethical Considerations in AI
  - Ethical Guidelines (e.g., IEEE, ACM)
  - Responsible AI Practices
  - Case Studies and Best Practices


# Machine Learning

## Introduction to Machine Learning
- Definition and Overview
  - What is Machine Learning?
  - History and Evolution
  - Key Concepts and Terminology
- Types of Machine Learning
  - Supervised Learning
    - Definition and Examples
    - Use Cases
    - Algorithms (e.g., `LinearRegression` from `sklearn.linear_model`)
  - Unsupervised Learning
    - Definition and Examples
    - Use Cases
    - Algorithms (e.g., `KMeans` from `sklearn.cluster`)
  - Semi-supervised Learning
    - Definition and Examples
    - Use Cases
    - Algorithms (e.g., Self-Training from `sklearn.semi_supervised`)
  - Reinforcement Learning
    - Definition and Examples
    - Use Cases
    - Algorithms (e.g., Q-Learning, Deep Q-Networks with `tensorflow` or `pytorch`)
- Applications of Machine Learning
  - Healthcare
  - Finance
  - Retail
  - Autonomous Vehicles
  - Natural Language Processing

## Data Preprocessing
- Data Cleaning
  - Handling Missing Values
    - Mean/Median Imputation (e.g., `SimpleImputer` from `sklearn.impute`)
    - Dropping Missing Values (e.g., `dropna()` from `pandas`)
    - Filling with Forward/Backward Fill (e.g., `fillna(method='ffill')` from `pandas`)
  - Handling Outliers
    - Z-Score Method (e.g., `scipy.stats.zscore`)
    - IQR Method (e.g., using `quantile` from `pandas`)
    - Winsorization (e.g., `winsorize` from `scipy.stats.mstats`)
- Data Transformation
  - Encoding Categorical Variables
    - One-Hot Encoding (e.g., `OneHotEncoder` from `sklearn.preprocessing`)
    - Label Encoding (e.g., `LabelEncoder` from `sklearn.preprocessing`)
    - Binary Encoding (e.g., `binary` from `category_encoders`)
  - Feature Scaling
    - Normalization (Min-Max Scaling) (e.g., `MinMaxScaler` from `sklearn.preprocessing`)
    - Standardization (Z-Score Scaling) (e.g., `StandardScaler` from `sklearn.preprocessing`)
  - Feature Engineering
    - Creating New Features (e.g., using `pandas`)
    - Polynomial Features (e.g., `PolynomialFeatures` from `sklearn.preprocessing`)
    - Interaction Features (e.g., using `pandas` and custom functions)
    - Log Transformations (e.g., `numpy.log`)

## Exploratory Data Analysis (EDA)
- Descriptive Statistics
  - Measures of Central Tendency (Mean, Median, Mode) (e.g., `mean`, `median`, `mode` from `numpy` or `pandas`)
  - Measures of Dispersion (Variance, Standard Deviation, Range) (e.g., `var`, `std` from `numpy` or `pandas`)
  - Skewness and Kurtosis (e.g., `skew`, `kurtosis` from `scipy.stats`)
- Data Visualization Techniques
  - Histograms (e.g., `hist` from `matplotlib.pyplot` or `seaborn`)
  - Box Plots (e.g., `boxplot` from `matplotlib.pyplot` or `seaborn`)
  - Scatter Plots (e.g., `scatter` from `matplotlib.pyplot` or `seaborn`)
  - Pair Plots (e.g., `pairplot` from `seaborn`)
  - Heatmaps (e.g., `heatmap` from `seaborn`)
  - Violin Plots (e.g., `violinplot` from `seaborn`)
- Identifying Patterns and Relationships
  - Correlation Analysis
    - Pearson Correlation (e.g., `corr` from `pandas`)
    - Spearman Correlation (e.g., `spearmanr` from `scipy.stats`)
  - Trend Analysis (e.g., using `pandas` time series methods)
  - Detecting Seasonality (e.g., using `statsmodels.tsa`)

## Supervised Learning
- Regression Algorithms
  - Linear Regression
    - Assumptions
    - Model Building (e.g., `LinearRegression` from `sklearn.linear_model`)
    - Evaluation Metrics (MSE, RMSE, R²) (e.g., `mean_squared_error`, `r2_score` from `sklearn.metrics`)
  - Polynomial Regression
    - Polynomial Features (e.g., `PolynomialFeatures` from `sklearn.preprocessing`)
    - Overfitting and Underfitting
  - Ridge and Lasso Regression
    - Regularization Techniques (e.g., `Ridge`, `Lasso` from `sklearn.linear_model`)
    - Hyperparameter Tuning (α) (e.g., `GridSearchCV` from `sklearn.model_selection`)
  - Support Vector Regression
    - Kernel Trick (e.g., `SVR` from `sklearn.svm`)
    - Epsilon-Insensitive Loss
- Classification Algorithms
  - Logistic Regression
    - Sigmoid Function
    - Thresholding
    - ROC Curve and AUC (e.g., `roc_curve`, `auc` from `sklearn.metrics`)
  - k-Nearest Neighbors (k-NN)
    - Distance Metrics (Euclidean, Manhattan) (e.g., `KNeighborsClassifier` from `sklearn.neighbors`)
    - Choosing k
  - Support Vector Machines (SVM)
    - Linear SVM
    - Kernel SVM (RBF, Polynomial) (e.g., `SVC` from `sklearn.svm`)
    - Hyperparameters (C, Gamma)
  - Decision Trees
    - Splitting Criteria (Gini, Entropy) (e.g., `DecisionTreeClassifier` from `sklearn.tree`)
    - Pruning Techniques
  - Random Forests
    - Bagging (e.g., `RandomForestClassifier` from `sklearn.ensemble`)
    - Feature Importance
  - Gradient Boosting
    - Boosting Principle
    - Variants (AdaBoost, Gradient Boosting, XGBoost, LightGBM) (e.g., `GradientBoostingClassifier`, `XGBClassifier`, `LGBMClassifier`)
  - Neural Networks
    - Perceptrons
    - Multilayer Perceptrons (MLP) (e.g., `MLPClassifier` from `sklearn.neural_network`)
    - Activation Functions (ReLU, Sigmoid, Tanh)

## Unsupervised Learning
- Clustering Algorithms
  - k-Means Clustering
    - Choosing k (Elbow Method, Silhouette Score)
    - Initial Centroid Selection (e.g., `KMeans` from `sklearn.cluster`)
  - Hierarchical Clustering
    - Agglomerative vs. Divisive
    - Dendrograms (e.g., `AgglomerativeClustering` from `sklearn.cluster`, `dendrogram` from `scipy.cluster.hierarchy`)
  - DBSCAN
    - Density-Based Clustering
    - Parameters (Epsilon, MinPts) (e.g., `DBSCAN` from `sklearn.cluster`)
- Dimensionality Reduction
  - Principal Component Analysis (PCA)
    - Eigenvalues and Eigenvectors
    - Explained Variance Ratio (e.g., `PCA` from `sklearn.decomposition`)
  - t-Distributed Stochastic Neighbor Embedding (t-SNE)
    - Perplexity
    - Use Cases (e.g., `TSNE` from `sklearn.manifold`)
  - Linear Discriminant Analysis (LDA)
    - Maximizing Class Separability
    - Comparison with PCA (e.g., `LinearDiscriminantAnalysis` from `sklearn.discriminant_analysis`)

## Semi-Supervised Learning
- Self-Training
  - Pseudo-Labeling
    - Confidence Thresholds
    - Iterative Refinement
- Co-Training
  - View Selection
    - Independent Feature Sets
- Graph-Based Semi-Supervised Learning
  - Label Propagation (e.g., `LabelPropagation` from `sklearn.semi_supervised`)
  - Graph Convolutional Networks

## Reinforcement Learning
- Introduction to Reinforcement Learning
  - Basics of RL
  - Key Concepts: Agent, Environment, State, Action, Reward
- Markov Decision Processes (MDP)
  - States and Actions
  - Transition Probabilities
  - Rewards
- Q-Learning
  - Q-Table
  - Bellman Equation
  - Exploration vs. Exploitation
- Deep Q-Networks (DQN)
  - Neural Network Architecture (e.g., using `tensorflow` or `pytorch`)
  - Experience Replay
  - Target Networks
- Policy Gradient Methods
  - REINFORCE Algorithm
  - Actor-Critic Methods
  
## Model Evaluation and Selection
- Train/Test Split
  - Holdout Validation
  - Stratified Sampling
- Cross-Validation
  - k-Fold Cross-Validation
    - Advantages and Disadvantages
  - Leave-One-Out Cross-Validation
    - Use Cases
- Evaluation Metrics for Regression
  - Mean Squared Error (MSE)
  - Root Mean Squared Error (RMSE)
  - Mean Absolute Error (MAE)
  - R² Score
- Evaluation Metrics for Classification
  - Accuracy
  - Precision
  - Recall
  - F1 Score
  - ROC Curve and AUC
  - Confusion Matrix
- Overfitting and Underfitting
  - Bias-Variance Tradeoff
  - Techniques to Mitigate Overfitting (Regularization, Pruning, Early Stopping)
- Model Selection and Hyperparameter Tuning
  - Grid Search
  - Random Search
  - Bayesian Optimization
  
## Ensemble Learning
- Bagging
  - Bootstrap Aggregating
  - Random Forests
    - Out-of-Bag Error
- Boosting
  - AdaBoost
    - Weight Updates
  - Gradient Boosting
    - Gradient Descent
    - Learning Rate
  - XGBoost
    - Regularization Parameters
  - LightGBM
    - Leaf-wise Growth
- Stacking
  - Base Models
  - Meta-Model
- Voting Classifiers
  - Hard Voting
  - Soft Voting
  
## Neural Networks and Deep Learning
- Basics of Neural Networks
  - Perceptrons
  - Multilayer Perceptrons (MLP)
    - Forward Propagation
    - Backpropagation
- Activation Functions
  - Sigmoid
  - Tanh
  - ReLU
  - Leaky ReLU
- Training Neural Networks
  - Gradient Descent
    - Batch Gradient Descent
    - Stochastic Gradient Descent
    - Mini-Batch Gradient Descent
  - Optimizers
    - Adam
    - RMSprop
    - Adagrad
- Convolutional Neural Networks (CNNs)
  - Convolutional Layers
  - Pooling Layers
  - Fully Connected Layers
  - Dropout Regularization
- Recurrent Neural Networks (RNNs)
  - Basic RNN
  - Long Short-Term Memory (LSTM) Networks
  - Gated Recurrent Unit (GRU)
- Autoencoders
  - Basic Autoencoder
  - Denoising Autoencoder
  - Variational Autoencoder (VAE)
- Generative Adversarial Networks (GANs)
  - Generator and Discriminator
  - Training GANs
  - Applications (Image Generation, Style Transfer)

## Natural Language Processing (NLP)
- Text Preprocessing
  - Tokenization (e.g., `word_tokenize` from `nltk`, `Tokenizer` from `keras.preprocessing.text`)
    - Word Tokenization
    - Sentence Tokenization
  - Stop Words Removal (e.g., `stopwords` from `nltk.corpus`)
  - Stemming (e.g., `PorterStemmer` from `nltk.stem`)
  - Lemmatization (e.g., `WordNetLemmatizer` from `nltk.stem`)
  - N-grams
- Feature Extraction
  - Bag-of-Words (BoW)
    - Count Vectorization (e.g., `CountVectorizer` from `sklearn.feature_extraction.text`)
      - Vector Representation
      - Sparsity Issues
  - Term Frequency-Inverse Document Frequency (TF-IDF)
    - TF-IDF Vectorization (e.g., `TfidfVectorizer` from `sklearn.feature_extraction.text`)
      - Calculation
      - Applications
  - Word Embeddings
    - Word2Vec (e.g., `Word2Vec` from `gensim.models`)
        - CBOW and Skip-Gram
    - GloVe (Global Vectors for Word Representation)
    - FastText
- Text Classification
  - Naive Bayes Classifier
    - Multinomial Naive Bayes (e.g., `MultinomialNB` from `sklearn.naive_bayes`)
  - Support Vector Machines (SVM)
    - Linear SVM for Text Classification (e.g., `LinearSVC` from `sklearn.svm`)
  - Recurrent Neural Networks (RNN)
    - Long Short-Term Memory Networks (LSTM) (e.g., `LSTM` from `keras.layers`)
    - Gated Recurrent Units (GRU)
  - Transformer Models
    - Encoder-Decoder Architecture
    - BERT (e.g., `transformers.BertModel` from `transformers` library)
    - GPT (e.g., `transformers.GPT2Model` from `transformers` library)
- Text Generation
  - Recurrent Neural Networks (RNN)
    - Character-Level RNNs
  - Transformer Models
    - GPT-3, GPT-4 (e.g., `OpenAI API`)
  - Retrieval-Augmented Generation (RAG)
    - Combining Retrieval and Generation (e.g., `transformers.RagTokenizer`, `transformers.RagModel` from `transformers` library)
    
## Time Series Analysis
- Time Series Decomposition
  - Additive and Multiplicative Models
  - Trend, Seasonality, and Residuals
- ARIMA Models
  - Autoregressive (AR)
  - Integrated (I)
  - Moving Average (MA)
  - ARIMA Model Building
- Exponential Smoothing
  - Simple Exponential Smoothing
  - Holt’s Linear Trend Model
  - Holt-Winters Seasonal Model
- Long Short-Term Memory (LSTM) for Time Series
  - Sequence Prediction
  - Handling Long Sequences

## Anomaly Detection
- Techniques and Algorithms
  - Statistical Methods
    - Z-Score (e.g., `scipy.stats.zscore`)
    - Grubbs' Test
  - Proximity-Based Methods
    - k-Nearest Neighbors (k-NN) (e.g., `LocalOutlierFactor` from `sklearn.neighbors`)
    - DBSCAN (e.g., `DBSCAN` from `sklearn.cluster`)
  - Clustering-Based Methods
    - k-Means Clustering (e.g., `KMeans` from `sklearn.cluster`)
    - Isolation Forest (e.g., `IsolationForest` from `sklearn.ensemble`)
  - Machine Learning-Based Methods
    - One-Class SVM (e.g., `OneClassSVM` from `sklearn.svm`)
    - Autoencoders (e.g., using `tensorflow.keras` or `pytorch`)

## Model Deployment
- Saving and Loading Models
  - Using Pickle (e.g., `pickle.dump`, `pickle.load`)
  - Using Joblib (e.g., `joblib.dump`, `joblib.load`)
- Model Serving
  - REST APIs (Flask, FastAPI)
    - Building API Endpoints (e.g., `flask.Flask`, `fastapi.FastAPI`)
    - Handling Requests and Responses
  - Web Services (Django, Flask)
    - Integrating Machine Learning Models
    - Handling User Inputs (e.g., `django.http`, `django.views`)
  - Cloud Services (AWS SageMaker, Google AI Platform)
    - Deploying Models
    - Monitoring and Scaling
- Monitoring and Maintenance
  - Model Performance Monitoring
    - Tracking Metrics Over Time
    - Setting Alerts for Degradation
  - A/B Testing
    - Designing Experiments
    - Analyzing Results
  - Model Retraining
    - Triggering Retraining
    - Automating Pipelines

## Tools and Libraries
- Python Libraries
  - NumPy (e.g., `numpy.array`, `numpy.linalg`)
  - pandas (e.g., `pandas.DataFrame`, `pandas.Series`)
  - scikit-learn (e.g., `sklearn.preprocessing`, `sklearn.model_selection`)
  - TensorFlow (e.g., `tensorflow.keras`, `tensorflow.data`)
  - Keras (e.g., `keras.models`, `keras.layers`)
  - PyTorch (e.g., `torch.Tensor`, `torch.nn`)
  - SciPy (e.g., `scipy.stats`, `scipy.optimize`)
- Data Visualization Libraries
  - Matplotlib (e.g., `matplotlib.pyplot.plot`, `matplotlib.pyplot.show`)
  - Seaborn (e.g., `seaborn.scatterplot`, `seaborn.heatmap`)
  - Plotly (e.g., `plotly.graph_objs`, `plotly.express`)
  - Bokeh (e.g., `bokeh.plotting.figure`, `bokeh.io.show`)
  - Altair (e.g., `alt.Chart`, `alt.data_transformers`)
- Tools for Model Deployment
  - Flask (e.g., `flask.Flask`, `flask.request`)
  - Django (e.g., `django.http`, `django.views`)
  - FastAPI (e.g., `fastapi.FastAPI`, `fastapi.Request`)
  - Docker (e.g., Dockerfiles, `docker-compose.yml`)
  - Kubernetes (e.g., Pods, Deployments, Services)

## Ethical and Responsible AI
- Bias and Fairness
  - Identifying and Mitigating Bias
    - Data Bias Detection (e.g., `sklearn.metrics` fairness metrics)
    - Algorithmic Fairness (e.g., Fairlearn toolkit)
  - Fairness Metrics
    - Demographic Parity
    - Equalized Odds
- Explainability and Interpretability
  - LIME (Local Interpretable Model-agnostic Explanations)
    - Using LIME (e.g., `lime.lime_tabular`)
  - SHAP (SHapley Additive exPlanations)
    - Using SHAP (e.g., `shap.TreeExplainer`, `shap.KernelExplainer`)
  - Model-Specific Methods
    - Feature Importance in Trees (e.g., `feature_importances_` in `sklearn.ensemble` models)
- Privacy and Security
  - Differential Privacy
    - Adding Noise to Data
    - Privacy-Preserving Mechanisms
  - Federated Learning
    - Training Across Multiple Devices
    - Aggregating Results Securely
  - Secure Multi-Party Computation
    - Techniques and Protocols
- Ethical Considerations in AI
  - Ethical Guidelines (e.g., IEEE, ACM)
  - Responsible AI Practices
    - Transparency
    - Accountability
  - Case Studies and Best Practices
    - Real-World Examples
    - Lessons Learned


In [6]:
import pandas as pd
import scipy
import numpy as np
from sklearn.preprocessing import MinMaxScaler
import seaborn as sns
import matplotlib.pyplot as plt

# Importing

In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
scaler = MinMaxScaler()

In [None]:
# another way. just different based on directory thing?
from sklearn import preprocessing
scaler = preprocessing.MinMaxScaler()

# Data Pre-processing

resources:
1. https://www.geeksforgeeks.org/data-preprocessing-machine-learning-python/?ref=header_search
2. https://www.geeksforgeeks.org/ml-feature-scaling-part-2/
3. http://localhost:8889/notebooks/tjyana/05-ML/02-Prepare-the-dataset/data-preprocessing-workflow/Preprocessing-Workflow.ipynb

In [38]:
# Load dataset
# others: load_diabetes, load_digits, load_boston, load_breast_cancer, load_linnerud, load_sample_image, load_sample_images, load_wine

# replacements for boston: 

# from sklearn.datasets import fetch_california_housing
# housing = fetch_california_housing()

# from sklearn.datasets import fetch_openml
# housing = fetch_openml(name="house_prices", as_frame=True)


from sklearn.datasets import fetch_california_housing
data = fetch_california_housing()

df = pd.DataFrame(data = data.data, columns = data.feature_names) # turn to df
df

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25
...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32


### Duplicate values

In [41]:
# Find duplicates
# df.duplicated() returns if duplicated or not 

# sum it with this
df.duplicated().sum()

0

In [42]:
# Remove
# df.drop_duplicates()

df = df.drop_duplicates()
df

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25
...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32


### Missing values 

#### Identifying missing values

In [43]:
# df.isnull()

df.isnull().sum()

MedInc        0
HouseAge      0
AveRooms      0
AveBedrms     0
Population    0
AveOccup      0
Latitude      0
Longitude     0
dtype: int64

#### Dealing with missing values

##### remove rows

In [None]:
# remove rows with missing values 
df.dropna()

##### drop the column entirely

In [None]:
# df.drop(columns='COLUMN_NAME', inplace=True) 
# CHECK THIS ONE BC NOT SURE

##### imput

In [None]:
# imput missing values with mean, median, or mode 
df.fillna(df.mean())

In [None]:
# imput with other data
# df['COLUMN_NAME'].replace(np.nan, 'NEW_VALUE', inplace=True)

In [45]:
# use imputers
# SimpleImputer

In [1]:

        
2. Encode categorical variables
    - identify `df.select_dtypes(include=['object']).columns`
    - choose encoding method
        - one-hot
            `pd.get_dummies(df, columns=caegorical_cols)`
        - label
            ```from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
for col in categorical_cols:
    df[col] = label_encoder.fit_transform(df[col])```
3. Feature scaling
    - import `from sklearn.preprocessing import StandardScaler`
    - choose scaling method
        - StandardScaler
            ```scaler = StandardScaler()
df_scaled = scaler.fit_transform(df)```
        - MinMaxScaler
            ```from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df_scaled = scaler.fit_transform(df)```

SyntaxError: invalid syntax (1553292181.py, line 1)

# HEY LET'S TRY DOING ONE FULL PREPROCESS CYCLE BEFORE TAKING NOTES

# YOU'RE TAKING TOO LONG

In [None]:
from sklearn.model_selection import train_test_split
df = df.fillna(df.mean())
train, test = train_test_split(df)
train