## Module 1

### What is Machine Learning?

Program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.
To have a learning problem, we must identify:
- class of tasks T
- performance measure P
- source of experience E

Traditional programming vs Machine Learning
- Traditional programming: (data) + (program) = (output)
- Machine Learning: (data) + (output) = (program)

#### Table with learning tasks, performance measures and experience sources

| Task | Performance Measure | Experience Source |
| --- | --- | --- |
| Email spam filter | Accuracy of the filter | User marks emails as spam/not spam |
| Handwritten digit recognition | Accuracy of the classifier | User provides examples of digits |
| Self-driving car | Safety and efficiency of the car | User drives the car |
| Playing checkers | % of games won against opponent | Games played against itself |

#### When do we use / not use Machine Learning?
Used when:
- lots of hand-tuning, long lists of rules, or hard to define rules
- complex / fluctuating environment
- expert knowledge does not exist, or is difficult to obtain
- models based on huge amount of data, must be customized to each individual

Not used when:
- simple, static environment, well-defined rules
- no uncertainty in the environment
- expert knowledge is available

### Machine Learning Process

| Step | Description |
| --- | --- |
| 1. Define the Problem     | Clearly define the problem statement, including the goal and the target variable(s).<br> Identify the available resources, constraints, and relevant stakeholders.<br> Understand the domain knowledge and business context to ensure the problem's relevance. |
| 2. Data Collection        | Determine the data requirements based on the problem definition.<br> Identify potential data sources and acquire the necessary datasets.<br> Ensure data quality by performing data validation, cleaning, and handling missing values or outliers. |
| 3. Data Exploration       | Perform statistical analysis, such as summary statistics and data distributions.<br> Visualize the data through plots, histograms, scatterplots, or heatmaps.<br> Identify correlations, patterns, and outliers within the dataset.<br> Conduct feature correlation analysis to understand relationships between variables. |
| 4. Feature Engineering    | Select relevant features based on domain knowledge and exploration.<br> Handle categorical variables through techniques like one-hot encoding or ordinal encoding.<br> Scale numerical features to a common range or apply normalization techniques.<br> Create new features by transforming or combining existing ones (e.g., feature interactions, polynomial features). |
| 5. Model Selection        | Identify the problem type (classification, regression, clustering, etc.).<br> Consider the characteristics of the dataset (e.g., size, dimensionality) and the assumptions of different algorithms.<br> Evaluate various algorithms and choose the one that best suits the problem and data. |
| 6. Model Training         | Split the data into training and testing sets (e.g., using random sampling or time-based splitting).<br> Apply the chosen algorithm to the training data and optimize its hyperparameters.<br> Evaluate the model's performance on the testing set using appropriate metrics.<br> Repeat the training process with different algorithms or parameter settings if necessary. |
| 7. Model Evaluation       | Calculate evaluation metrics such as accuracy, precision, recall, F1 score, or mean squared error.<br> Perform cross-validation or holdout validation to estimate the model's performance on unseen data.<br> Analyze the model's strengths, weaknesses, and potential biases.<br> Consider business requirements and domain-specific metrics for a comprehensive evaluation. |
| 8. Model Optimization     | Fine-tune the model's hyperparameters through techniques like grid search, random search, or Bayesian optimization.<br> Regularize the model to prevent overfitting using techniques like L1/L2 regularization or dropout.<br> Explore ensemble methods, such as bagging or boosting, to improve model performance.<br> Use feature selection techniques to remove irrelevant or redundant features. |
| 9. Model Deployment       | Prepare the model for deployment by saving its trained parameters and associated preprocessing steps.<br> Integrate the model into an application, system, or cloud infrastructure.<br> Design and implement an API for making predictions using the deployed model.<br> Ensure the model's scalability, robustness, and security in a production environment. |
| 10. Monitoring and Maintenance | Continuously monitor the model's performance in real-world scenarios.<br> Collect feedback and track performance metrics to detect any degradation or concept drift.<br> Retrain the model periodically with new data to keep it up-to-date and maintain its accuracy.<br> Conduct regular model audits and updates as needed. |
| 11. Iteration and Improvement | Regularly revisit and refine the model as new insights are gained, data quality improves, or new techniques emerge.<br> Incorporate feedback from stakeholders and address any limitations or shortcomings.<br> Continuously experiment with new algorithms or approaches to improve the model's performance and adapt to evolving requirements. |

#### Types of learning:
- Supervised (inductive) learning
    - given training data, desired outputs (labels)
    - learn a function that maps inputs to outputs
    - types:
        - classification (predict class or category, discrete value)
            - binary classification (2 classes)
            - multi-class classification (more than 2 classes)
        - regression (predict continuous value)
- Unsupervised (deductive) learning
    - given training data, no desired outputs
    - learn a function that describes hidden structure from unlabeled data
- Semi-supervised learning
    - given training data, some desired outputs
    - learn a function that maps inputs to outputs
- Reinforcement learning
    - rewards from sequence of actions
    - learn a function that maximizes a reward signal

High level, general comparison table:

|                       | Supervised Learning               | Unsupervised Learning          | Semi-Supervised Learning             | Reinforcement Learning                    |
|-----------------------|-----------------------------------|--------------------------------|--------------------------------------|------------------------------------------|
| Data                  | Labelled                          | Unlabelled                     | Mix of Labelled and Unlabelled       | Depends on State and Reward              |
| Task                  | Prediction                        | Pattern Recognition            | Prediction                           | Sequential Decision Making               |
| Example Algorithms    | Linear Regression, SVM, Neural Networks | Clustering, K-Means, PCA | Self-Training, Multi-View Training   | Q-Learning, SARSA, DQN                    |
| Feedback              | Direct                            | None                           | Partial                              | Reward-based                              |
| Goal                  | Minimize Error on Given Labels    | Discover Hidden Structure      | Better Generalization Accuracy       | Maximize Cumulative Reward                |
| Typical Use Case      | Image Recognition, Email Spam Detection | Customer Segmentation, Anomaly Detection | Web Content Classification, Bioinformatics | Game AI, Robot Navigation, Real-time Decisions |
| Training Efficiency   | High (due to direct feedback)     | Medium (no feedback)           | Varies (depends on labeled/unlabeled ratio) | Typically slow, trial and error-based      |
| Complexity of Problem | Low-Medium                        | High                           | Medium-High                          | High                                      |
| Real-time Adaptation  | Not Typically                     | Not Typically                  | Not Typically                        | Yes, using online learning                 |


# Big Data Systems Concepts

### Cache performance

- cache hit: data requested by processor found in cache
- cache miss: data requested by processor not found in cache, must be retrieved from main memory
- cache hit ratio: fraction of memory accesses found in cache, $h = \frac{hits}{hits + misses}$
- average access time of any memory access: $t_{avg} = h*t_{cache} + (1-h)*t_{memory}$
    - $t_{cache}$ - access time of cache
    - $t_{memory}$
- time required to access main memory block = $\text{block\_size} * t_{memory}$
- time required to update cache block = $\text{cache\_block\_size} * t_{cache}$







