# Statisitcs and Research methods

## Definition of Instances in Machine Learning
An instance (also called an example, observation, or sample) refers to a single data point in a dataset. Each instance represents an individual record with multiple features (variables) that the model uses for learning and prediction.

## Advantages of Replacing human decision-making with machines
Consistency & Objectivity
- Machines follow predefined rules and algorithms, reducing biases, emotions, or personal preferences that can affect human decisions.

Efficiency & Speed
- Automated decision-making is significantly faster than human processing, especially in large-scale data-driven applications like finance, healthcare, and logistics.

Scalability
- Machines can handle vast amounts of data and make thousands or millions of decisions simultaneously, which is impossible for humans.

Reduction in Human Error 
- Machines don’t suffer from fatigue, stress, or cognitive overload, reducing the likelihood of mistakes in decision-making.

Data-Driven Insights 
- Machine learning and AI can analyze complex patterns and relationships in data that humans might overlook, leading to more informed and evidence-based decisions.

Cost Savings 
- Automating decisions can reduce labor costs and the need for human oversight in routine or repetitive tasks.

24/7 Availability 
- Unlike humans, machines don’t need rest, enabling continuous operations in industries such as customer service, healthcare, and security.

Fairness in Some Contexts 
- When properly designed, AI systems can eliminate certain human biases and discrimination in decision-making, such as in hiring or loan approvals.

Handling Complexity 
- Machines can process multidimensional data and complex scenarios beyond human cognitive limits, leading to better optimization and strategic planning.

Improved Safety 
- In high-risk industries (e.g., autonomous vehicles, industrial automation, medical diagnostics), machine-based decision-making can minimize accidents and improve safety outcomes.

## Understanding Statistical Models vs. Machine Learning Models

It's essential first to understand the distinctions between statistical models and machine learning models, as they serve different purposes, assumptions, and interpretative depth.

- Statistical Models: 
    - These are rooted in traditional statistics and 
        - focus on relationships between variables through predefined equations. 
    - Statistical models aim to understand the underlying data-generating process, focusing on hypothesis testing and inference. 
    - These models often rely on strong assumptions like:
        - linearity, 
        - normality, and 
        - homoscedasticity 
        - and are **interpretable**, making it easier to understand the impact of individual variables.

- Machine Learning Models: 
    - These prioritize **predictive** power over interpretability. 
    - They are designed to automatically learn patterns and relationships within data, often with minimal assumptions. 
    - Machine learning models can handle complex and high-dimensional data but may lack transparency about how individual features affect the outcome, especially in “black box” models like neural networks or ensemble methods.

| Aspect | Machine Learning   | Statistics            |
|---------------|--------------------|----------------------------------|  
|Goal	|Prediction and pattern recognition, often focusing on optimizing performance on unseen data. |	Understanding relationships between variables, inference, and hypothesis testing.|
|Approach	| Data-driven; models learn patterns from data without predefined assumptions. | Model-driven; relies on predefined mathematical models and assumptions about data. |
|Focus	|Generalization to new data (predictive power). |	Explanation and inference (understanding relationships).|
|Assumptions	|Often makes fewer explicit assumptions about data distributions (e.g., neural networks, decision trees). |	Relies on assumptions like normality, independence, and linearity (e.g., regression models).|
|Data Needs	|Often requires large datasets to perform well. |	Can work well with smaller datasets when assumptions hold.|
|Model Complexity |	Can involve highly complex models (e.g., deep learning, random forests) with many parameters. |	Often relies on simpler, interpretable models like linear regression or logistic regression..|
|Interpretability	|Many ML models (e.g., neural networks, ensemble models) are black boxes with limited interpretability. |	Statistical models are generally interpretable, with well-defined coefficients and p-values.|
|Use of Probability	|Probability is often implicit (e.g., confidence scores in classification). |	Probability theory is fundamental, with formal probability distributions and significance testing.|
|Evaluation Metrics	| 	Focuses on prediction accuracy (e.g., RMSE, F1-score, log-loss, AUC-ROC). |	Focuses on statistical significance (e.g., p-values, confidence intervals, hypothesis tests).|
|Handling of Bias & Variance	| Uses techniques like cross-validation, regularization, and ensembling to balance bias-variance trade-off. |	Uses theoretical methods like maximum likelihood estimation and Bayesian inference.|
|Common Applications| Image recognition, fraud detection, recommendation systems, NLP, autonomous systems. |	Medical studies, economics, market research, A/B testing, survey analysis.|

## Data Mining
Data mining is the process of discovering patterns, correlations, trends, or useful information from large sets of data using statistical, machine learning, and computational techniques. It transforms raw data into meaningful insights for decision-making.

###  Core Steps in Data Mining
- Data Cleaning – Removing noise or irrelevant data.
- Data Integration – Combining data from multiple sources.
- Data Selection – Choosing relevant data for analysis.
- Data Transformation – Converting data into appropriate formats.
- Data Mining – Applying algorithms to extract patterns.
- Pattern Evaluation – Identifying truly interesting patterns.
- Knowledge Representation – Visualizing the mined data.

Applications in Different Industries
1. Retail & E-commerce
    - Market Basket Analysis: Identifies products frequently bought together (e.g., Amazon's "Customers who bought this also bought…").
    - Customer Segmentation: Target marketing strategies based on purchasing behavior.
    - Inventory Management: Forecasting demand to optimize stock levels.

2. Banking & Finance
    - Fraud Detection: Uncover suspicious transactions using anomaly detection.
    - Credit Scoring: Predicting creditworthiness of customers.
    - Algorithmic Trading: Analyzing market patterns for automated trading.

Finance
- Fraud Detection
    - How: Use classification algorithms (e.g., Decision Trees, Neural Networks) to detect unusual transaction patterns.
    - Example: A credit card company flags a transaction in another country right after a local purchase, indicating possible fraud.
- Credit Risk Assessment
    - How: Apply logistic regression or machine learning models to historical customer data (income, debt, payment history).
    - Example: A bank evaluates a loan applicant's likelihood of default based on mined patterns from past defaulters.

Marketing
- Market Basket Analysis
    - How: Use association rule mining (e.g., Apriori algorithm) to find product combinations frequently purchased together.
    - Example: A grocery store finds that people who buy diapers often buy baby wipes and adjusts shelf placement and promotions accordingly.
- Campaign Response Modeling
    - How: Use classification or uplift models to predict which customers are likely to respond to marketing campaigns.
    - Example: An online retailer sends discount codes only to customers who are predicted to convert, improving ROI.

Customer Relationship Management (CRM)
- Customer Segmentation
    - How: Use clustering techniques like K-means to group customers by behavior, preferences, or demographics.
    - Example: A telecom company segments users into heavy data users vs. voice-call users to tailor service plans.
- Churn Prediction
    - How: Use predictive models to identify customers likely to leave based on usage patterns, complaints, or support interactions.
    - Example: A streaming service targets users showing signs of disengagement with personalized recommendations to retain them.

3. Healthcare
    - Disease Prediction & Diagnosis: Identifying potential diseases based on symptoms and historical data.
    - Patient Profiling: Personalizing treatment plans using patient history.
    - Drug Discovery: Analyzing clinical trials and genomic data to discover new drugs.

4. Telecommunications
    - Churn Prediction: Identifying customers likely to leave and creating retention strategies.
    - Network Optimization: Improving service quality by analyzing usage patterns.

5. Manufacturing
    - Predictive Maintenance: Preventing equipment failure by analyzing sensor data.
    - Quality Control: Detecting anomalies in product quality during production.

6. Education
    - Student Performance Prediction: Identifying at-risk students for early intervention.
    - Curriculum Personalization: Recommending personalized learning paths.

7. Government & Public Sector
    - Crime Analysis: Detecting crime hotspots and patterns.
    - Tax Fraud Detection: Spotting inconsistencies in tax data.
    - Policy Making: Analyzing census and survey data for informed decisions.

## Data Science
Data Science is a multidisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data.

It combines elements of:
- Statistics & Mathematics – for analysis and inference
- Computer Science – for data processing and algorithm development
- Domain Knowledge – for interpreting results in context
- Machine Learning & AI – for building predictive and intelligent systems
- Data Engineering – for managing and transforming data at scale
- Visualization – for communicating insights effectively

🔄 Core Workflow of Data Science
- Problem Definition – Understanding the business question.
- Data Collection – Gathering data from various sources (databases, APIs, web scraping, sensors, etc.).
- Data Cleaning & Preprocessing – Handling missing data, duplicates, and inconsistent formatting.
- Exploratory Data Analysis (EDA) – Summarizing main characteristics using plots and statistics.
- Feature Engineering – Creating or selecting relevant variables for modeling.
- Model Building – Using statistical models or machine learning algorithms.
- Model Evaluation – Checking performance using metrics like accuracy, RMSE, AUC, etc.
- Deployment – Making models available in production (e.g., web apps, APIs).
- Monitoring & Maintenance – Continuously tracking model performance.

📌 Examples of Data Science in Action
- Healthcare: Predicting disease outbreaks or diagnosing illnesses from medical images.
- Finance: Credit scoring, fraud detection, algorithmic trading.
- Retail: Personalized recommendations, inventory optimization.
- Marketing: Customer segmentation, campaign effectiveness.
- Public Health: Modeling the spread of diseases, evaluating the impact of health interventions.

## Key Differences Between Data Science and Data Mining
|Feature	|Data Science	|Data Mining|
|-----------|---------------|-----------|
|Definition	|A broad interdisciplinary field focused on extracting insights and value from data using scientific methods.|	A specific process of discovering patterns and relationships in large datasets.|
|Scope	|Broad: includes data collection, cleaning, modeling, interpretation, and communication.	|Narrower: mainly focused on pattern discovery and knowledge extraction.|
|Goal	|To solve complex business problems and build predictive or decision-making systems.	|To uncover hidden patterns and associations in data.|
|Techniques Used|	Machine learning, deep learning, statistical modeling, big data processing, AI.|	Clustering, classification, association rule mining, anomaly detection.|
|Tools|	Python, R, SQL, Hadoop, Spark, TensorFlow, Tableau.|	Weka, RapidMiner, Orange, SQL, R, Python (with specific libraries).|
|Output|	Models, dashboards, predictive systems, business insights.|	Patterns, rules, summaries of relationships in data.|
|Data Types|	Structured, semi-structured, unstructured (text, image, audio).	|Mostly structured data.|
|Domain Knowledge Importance|	Very high – needed to guide the entire problem-solving process.|	Moderate – useful for interpreting discovered patterns.|
|Interdisciplinary?|	Yes – blends statistics, computer science, business knowledge.|	Not as interdisciplinary – rooted more in statistics and computer science.|

🧠 Analogy
- Data Science is like being a full chef – choosing the ingredients (data), preparing them, cooking the dish (model), and presenting it.
- Data Mining is like discovering hidden recipes in a big cookbook – finding which ingredients often go together and what dishes they create.

🧪 How They Work Together
- Data Mining is a step within the Data Science workflow.
- In Data Science, once data is prepared, data mining may be used to explore patterns before moving to more advanced modeling.

Step-by-Step Breakdown

✅ 1. Data Collection (Data Science)
- Sources: Transaction logs, user profiles, device information, IP geolocation, past fraud history.
- Tools: Python, SQL, Kafka, APIs
- Purpose: Gather raw structured (transaction) and unstructured (browser behavior) data.

🧼 2. Data Preprocessing (Data Science)
- Cleaning missing values, formatting timestamps, normalizing amounts, flagging duplicate entries.
- Encoding categorical data like merchant type or location.
- Feature engineering: Time between transactions, transaction velocity, location changes.

🔍 3. Data Mining (Pattern Discovery Phase)

Here’s where Data Mining techniques are applied:

a. Anomaly Detection
- Identify outliers: unusually large transactions or transactions from unfamiliar locations.
- Technique: Clustering (e.g., DBSCAN), distance-based anomaly detection.

b. Association Rule Mining
- Purpose: Find typical sequences of legitimate behavior.
- Example: If a customer usually buys groceries and fuel, a sudden electronics purchase overseas could be suspicious.

c. Frequent Pattern Mining
- Find commonly occurring fraud patterns from past data.
- Useful for building rules or heuristics.

🤖 4. Predictive Modeling (Data Science)
- Train a classification model (e.g., Logistic Regression, Random Forest, XGBoost, or Neural Networks) to predict:
    - 1 for fraud
    - 0 for legitimate
- Input Features:
    - Transaction amount, time of day, location, device ID, account age, historical fraud rate.
- Output: Probability of fraud.

📊 5. Model Evaluation & Deployment (Data Science)
- Metrics: Precision, recall, F1-score, ROC-AUC (since fraud is rare and imbalanced).
- Action: If the fraud score exceeds a threshold:
- Flag for manual review
- Automatically decline or challenge transaction (e.g., OTP)

## Data-Driven Decision Making (DDDM) 
Refers to the process of making decisions based on data analysis and insights rather than intuition, guesswork, or personal experience alone.

It involves collecting relevant data, analyzing it using statistical or machine learning tools, and using the results to guide strategic or operational choices. This approach helps organizations become more objective, consistent, and evidence-based in their actions.

🔄 Key Elements of Data-Driven Decision Making
- Collect: Gather high-quality, relevant data.
- Analyze: Use analytics, statistical methods, or AI models to uncover patterns and trends.
- Interpret: Understand what the data means in a business context.
- Decide: Take action based on the evidence.
- Measure: Monitor outcomes to refine future decisions.


📌 Two Types of Decisions That Benefit from DDDM
1. Strategic Decisions
    - These are long-term, high-impact decisions that shape the direction of an organization.

✅ Examples:
- Market Expansion: Deciding which new region or country to enter based on customer demographics, purchasing behavior, and economic indicators.
- Product Development: Launching a new feature or product based on customer feedback, usage patterns, and competitor analysis.
- Resource Allocation: Investing in technology, R&D, or marketing channels based on ROI predictions and cost-benefit analysis.

💡 Benefit of DDDM:
- Reduces risk of costly missteps.
- Aligns decision-making with real market needs and performance indicators.

2. Operational Decisions
    - These are day-to-day, tactical decisions that affect the efficiency and performance of processes.

✅ Examples:
- Customer Support: Routing customer service tickets based on issue type, past resolution time, or sentiment analysis.
- Inventory Management: Reordering stock based on demand forecasts and sales trends.
- Fraud Detection: Flagging suspicious transactions in real time using predictive analytics.

💡 Benefit of DDDM:
- Improves speed and accuracy of actions.
- Enhances customer experience and operational efficiency.

🎯 Summary:
|Decision Type|	Description|	Example|	Data-Driven Advantage|
|----------------|------------|-----------|-----------------------|
|Strategic|	Long-term, directional|	Choosing markets, product launches|	Informed planning and reduced uncertainty|
|Operational|	Daily/tactical|	Inventory, customer support, fraud checks|	Efficiency, automation, better performance|

### Case Study: PayPal – Combating Fraud and Expanding Services with Data
🏦 Company Overview
- PayPal is a leading online payments platform serving millions of customers globally. With millions of transactions per day, real-time risk management and service personalization are essential to their success.

🎯 1. Strategic Decision Example: Expanding into New Financial Services

✅ The Challenge:
- PayPal wanted to expand beyond peer-to-peer payments into offering credit, investment, and savings products. But which services should they prioritize?

🔍 Data Used:
- Transaction patterns of existing users
- User demographics and income bands
- Purchase categories (e.g., travel, subscriptions, bills)
- Creditworthiness signals
- Market trends in underserved regions

📊 Data-Driven Decision:
- Using customer segmentation and predictive analytics, PayPal identified:
    - Users who regularly paid bills or sent remittances as potential candidates for micro-lending.
    - Younger, digitally savvy users for investment products.
    - Regions where competitors lacked strong financial infrastructure.

💡 Outcome:
- Rolled out PayPal Credit and PayPal Savings successfully.
- Increased customer retention and cross-product adoption.
- Strategic expansion aligned with actual customer behavior, not assumptions.

⚙️ 2. Operational Decision Example: Real-Time Fraud Detection
✅ The Challenge:
- With thousands of global transactions per second, PayPal needed to detect and prevent fraud in real-time without blocking legitimate users.

🔍 Data Used:
- Device fingerprinting
- IP address and geolocation
- Time of transaction
- Velocity of transactions
- Historical fraud labels
- Behavioral signals (mouse movement, app usage)

🧠 Data-Driven Solution:
- PayPal built a real-time fraud detection engine using:
- Gradient Boosting Machines (GBMs) and Deep Learning models trained on historical fraud patterns.
- Ensemble models combining rule-based alerts and machine learning scores.
- Continuous feedback loops for model retraining.

💡 Outcome:
- Real-time detection of fraud with false positive rates under 1%.
- Prevented millions in fraud losses annually.
- Enabled smooth user experience by reducing false blocks.

### Automated Decision System: Fraud Scoring in Insurance Claims
🎯 Goal:
To automatically identify and flag potentially fraudulent insurance claims in real-time or near real-time, minimizing losses and improving investigation efficiency.

⚙️ How the System Works (Step-by-Step)
1. 📥 Input Data Collection

The system collects structured and unstructured data from:
- Claim forms (amount, type of claim, time of filing)
- Customer profile (age, past claims, location, occupation)
- Policy details (coverage type, duration, premium)
- External data sources (e.g., repair bills, police reports)
- Historical fraud cases and red flags
- Social network or geospatial data (in advanced setups)

2. 🧠 Feature Engineering

Key features are derived, including:
- Time delay between incident and filing
- Similarity to past fraudulent claims
- Number of claims in short periods
- Inconsistencies between claim details and supporting docs
- Unusual claim amounts for the policy type

3. 🤖 Machine Learning Model
A fraud scoring model is trained using labeled historical data (fraudulent vs. genuine claims).

Common algorithms used:
- Logistic Regression – interpretable and fast
- Random Forest / XGBoost – handles complex nonlinearities and interactions
- Neural Networks – used when unstructured data (text, images) are involved
- The model outputs a fraud probability score between 0 and 1.

4. 📊 Decision Thresholds & Automation Rules

Based on the fraud score, automated decision logic is applied:
|Fraud Score|	Action|
|-----------|----------|
|0.00 – 0.30|	Auto-approve claim|
|0.31 – 0.70|	Flag for manual review|
|0.71 – 1.00|	Auto-reject or escalate to SIU 🔍|

Rules may also include:
- Blocking payment triggers
- Sending alerts to investigators
- Logging metadata for audits

5. 🔁 Feedback Loop & Model Retraining
- Investigators label outcomes (fraud/not fraud)
- These labels are used to retrain the model periodically
- System gets smarter over time (adaptive fraud detection)

### Data Science for Decision-Makers – Cheat Sheet
🧩 1. What Is Data Science?
    - The art and science of turning data into decisions using statistics, machine learning, and domain expertise.

🔑 2. Key Concepts (Simple Definitions)

|Concept|	What It Means|	Why It Matters for You|
|---------|---------------|------------------------|
|Data|	Facts collected (e.g., sales, clicks, transactions)|	Fuel for all insights|
|Descriptive Analytics|	What happened? (e.g., "Sales dropped 10% last week")|	First step in understanding trends|
|Predictive Analytics|	What will happen? (e.g., "This customer may churn")|	Plan ahead; reduce risk|
|Prescriptive Analytics|	What should we do? (e.g., "Offer a 10% discount now")|	Turn insight into action|
|Machine Learning|	Algorithms that learn from data to make predictions|	Automates and improves decisions over time|
|Model|	A tool that finds patterns and makes predictions|	The engine behind data-driven decisions|
|Feature|	A variable or column (e.g., age, income, product views)|	Used to train models and find insights|
|Bias|	Skewed results due to bad data or wrong assumptions|	Can lead to unfair or wrong outcomes|

🧠 3. Must-Know Basics (for Interpreting Insights)

|Principle|	Why It Matters for You|
|---------|--------------------------|
|Correlation ≠ Causation|	Just because two things move together doesn’t mean one causes the other|
|Sample Size Matters|	Small or skewed data can lead to misleading results|
|Ask “Compared to What?”|	Always check baselines and control groups|
|Data Can Be Dirty|	Missing, inconsistent, or incorrect data affects results|
|All Models Have Assumptions|	No model is perfect—always ask “what was assumed?”|

🔍 4. Questions to Ask Your Data Team

1. What business question are we answering?
2. Where did the data come from?
3. How was success measured?
4. What assumptions does the model make?
5. How confident are we in this prediction?


When you're talking to a senior stakeholder like the CFO, your focus should be on understanding the business context, desired outcomes, and constraints.

## What is Big Data?
Big Data refers to extremely large and complex datasets that are too big or complex for traditional data processing software to handle efficiently.

Big Data is commonly characterized by the 5 V’s:

|V|	Meaning|	Example|
|-----|-------|-------|
|Volume|	Massive amounts of data|	Billions of credit card transactions|
|Velocity|	Data generated in real-time or near-real-time|	Social media updates, sensor data|
|Variety|	Different data types (structured, semi-structured, unstructured)|	Images, text, video, logs|
|Veracity|	Uncertainty or quality of data	Inconsistent or noisy data from user inputs|
|Value|	Turning raw data into actionable insights|	Predicting churn, detecting fraud|


### Why Appropriate Data and Data Scientists Are Required to Extract Useful Knowledge

1. 📊 Why Appropriate Data is Needed
    - Having the right data is crucial because:
        - Garbage in = garbage out: Poor quality, irrelevant, or biased data leads to incorrect conclusions.
        - Context matters: You need domain-relevant features to solve specific problems (e.g., you can’t predict loan defaults without credit history).
        - Completeness and Accuracy: Decisions rely on clean, complete, and well-structured data.
    - Example: Predicting insurance fraud using only age and gender is likely to fail. You need detailed claim behavior, policy details, and historical fraud patterns.

2. 🧠 Why Data Scientists Are Essential
    - Data scientists bring a blend of:
        - Statistics & Mathematics (to analyze trends, test hypotheses)
        - Programming (to work with large-scale data using tools like Python, R, SQL)
        - Machine Learning (to create predictive models)
        - Domain Knowledge (to ask the right questions and interpret results)
    - They ensure:
        - The right questions are asked
        - The right techniques are applied (e.g., regression vs. classification)
        - The insights are valid, explainable, and actionable
        - Without skilled data scientists, businesses might:
        - Misinterpret correlations as causation
        - Build biased or misleading models
        - Miss key patterns hiding in the data

### Why It's Necessary to Understand Data Science – Even If You Won’t Use It Directly
Even if you're not building models yourself, understanding data science is critical in a modern, data-driven world, especially for decision-makers.

1. 📣 Better Communication with Data Teams
    - You can ask the right questions
    - Understand reports, dashboards, and model limitations
    - Collaborate more effectively with technical teams

2. 🎯 Make Informed Decisions
    - Know what the data is really saying
    - Spot when results don’t make sense or when assumptions are flawed
    - Ask for more analysis where needed (e.g., “Can we break this down by customer segments?”)

3. 🛡️ Avoid Being Misled
    - Understand basics like correlation ≠ causation, sampling bias, or p-values
    - Protect yourself from overhyped or flawed analytics

4. 💼 Gain a Competitive Edge
    - Managers and leaders who understand data science are better at:
        - Allocating resources
        - Prioritizing projects
        - Identifying growth opportunities

## Four Fundamental Concepts of Data Science

1. 📊 Data Wrangling (or Data Preparation)
Definition: The process of cleaning, transforming, and organizing raw data into a usable format for analysis or modeling.

Why it matters:
- Real-world data is messy — it often has missing values, errors, duplicates, or inconsistencies. Effective data wrangling ensures that your dataset is accurate, complete, and structured properly, forming the foundation for trustworthy analysis.

Key tasks:
- Handling missing or inconsistent data
- Merging or reshaping datasets
- Feature engineering (creating meaningful variables)
- Data type conversions

2. 📈 Statistical Inference
Definition: Drawing conclusions about a population based on a sample using statistical methods (e.g., confidence intervals, hypothesis testing).

Why it matters:
- It allows you to make data-driven decisions even when you can't measure everything — using probabilities to estimate relationships, detect patterns, and test assumptions.

Key tools:
- Hypothesis testing (t-tests, chi-square tests)
- Regression analysis
- Probability distributions
- Confidence intervals and p-values

3. 🤖 Machine Learning
Definition: Algorithms that allow systems to learn patterns from data and make predictions or decisions without being explicitly programmed for every scenario.

Why it matters:
- It enables automation of decision-making, such as fraud detection, customer segmentation, churn prediction, or recommendation engines.

Key categories:
- Supervised learning: Prediction based on labeled data (e.g., classification, regression)
- Unsupervised learning: Finding hidden patterns in unlabeled data (e.g., clustering)
- Reinforcement learning: Learning from feedback (used in dynamic systems like trading bots)

4. 📊 Data Visualization & Communication
Definition: The graphical representation of data and insights using charts, graphs, dashboards, or storytelling techniques.

Why it matters:
- The value of analysis is only realized if stakeholders can understand and act on the insights. Visualization bridges the gap between technical analysis and business decision-making.

Key techniques:
- Exploratory data analysis (EDA)
- Charts (bar, line, scatter, boxplots)
- Dashboards (e.g., Tableau, Power BI)
- Storytelling with data (context, clarity, visuals)

## Selecting the right analytical approach and data mining algorithm 
Depends on the business problem, data characteristics, and desired outcome.

#### Key Questions to Determine the Right Analysis Approach
A. Understanding the Business Problem

Step 1: What is the business objective?

Ask:
- Are we trying to describe, predict, explain, or optimize something?
- Is the focus on understanding the past, making decisions now, or forecasting the future?

**What is the business objective?**
- Prediction? → Predictive Analytics
- Understanding why something happened? → Diagnostic / Causal Analysis
- Finding patterns in behavior? → Behavioral Segmentation / Clustering
- Optimizing decisions? → Prescriptive Analytics
- Forecasting future trends? → Time Series Decomposition / Forecasting

**Do we need to compare groups or test a hypothesis?**
- A/B Testing (if randomized controlled experiment)
- Hypothesis Testing (if comparing means/proportions)
- Inferential Statistics (if generalizing from a sample)

**Are we trying to measure relationships?**
- Correlation (if measuring association)
- Causal Inference (if determining cause-effect, e.g., propensity score matching, DAGs, RCTs)

**Is the data sequential/time-dependent?**
- Time Series Analysis (ARIMA, Prophet, LSTM)
- Survival Analysis (for churn/risk modeling)

**Are we dealing with uncertainty or risk?**
- Probabilistic Models (Bayesian Networks, Monte Carlo Simulation)
- Probability Distributions (Poisson for counts, Normal for continuous)

**Do we need to optimize marketing spend?**
- Marketing Mix Modeling (MMM) (regression-based attribution)
- Multi-Touch Attribution (MTA) (Shapley value, Markov chains)


|Business Objective|	Analysis Type|
|--------------------|----------------|
|What happened?|	Descriptive Analytics|
|Why did it happen?|	Diagnostic / Causal|
|What is likely to happen next?|	Predictive Analytics|
|What should we do about it?|	Prescriptive Analytics|


🔍 Step 2: What is the nature of the data and outcome?

Ask:
- Do we have labeled data (target variable)?
- Is the outcome categorical, continuous, binary, or unknown?
- Are we measuring change over time?
- Are we dealing with networks, groups, or events?

This determines whether we use:
- Statistical modeling (e.g., regression, hypothesis testing)
- Machine learning (e.g., classification, clustering)
- Time series analysis, or
- Behavioral/structural approaches (e.g., graph theory, segmentation)

### What Determines the Use of Statistical Modeling?
Ask these core questions:

✅ A. Do I need to understand relationships, not just predict outcomes?
- Statistical modeling helps you interpret how and why variables relate (e.g., how age impacts claim risk).

✅ B. Is interpretability more important than raw predictive power?
- If stakeholders (regulators, doctors, underwriters) need transparent insights, go for statistical models.

✅ C. Do I have a hypothesis or theory I want to test?
- If you're trying to validate an assumption (e.g., "Smoking increases risk of stroke"), statistical inference is key.

✅ D. Is my data sample small or medium-sized?
- Statistical models are efficient with limited data and can quantify uncertainty through confidence intervals, p-values, etc.

##### How to Break Down and Solve a Statistical Modeling Problem
🧠 Step-by-Step Framework

|Step|	What to Do|	Example (Loan Default)|
|-----------|-----------------|--------------|
|1. Define|	What’s the question/hypothesis?|	Does higher DTI increase risk of default?|
|2. Explore|	Understand data structure, missing values|	Check distributions, missing income data|
|3. Choose model|	Based on outcome type & interpretability|	Logistic Regression (binary outcome)|
|4. Fit model|	Train on labeled data|	Use statsmodels or sklearn|
|5. Validate assumptions|	Linearity, independence, etc.|	Check VIFs, ROC curve|
|6. Interpret results|	Focus on effect sizes, CIs, p-values|	Odds of default ↑ by 1.5x with DTI > 40%|
|7. Communicate|	Use visuals + plain language|	“Customers with low credit score & high DTI are 2x riskier”|

Use statistical modeling when:
- You want interpretable insights
- You aim to test hypotheses
- You want to explain relationships, not just predict
- You need inference, not just accuracy

##### 🏦 Financial Services

Use Case: Predict likelihood of loan default based on income, credit score, and debt-to-income ratio.

Model: Logistic Regression (binary outcome)

Question: What factors significantly impact default risk?

Solve:
- Run logistic regression
- Interpret coefficients (e.g., odds of default ↑ as DTI ↑)
- Use model to flag risky borrowers

Use Case: Test if a new digital onboarding process improves customer conversion.

Model: Two-sample t-test or A/B testing

Question: Is the difference in conversion rate statistically significant?

Solve:
- Randomly assign users to control and test groups
- Run t-test
- Check p-value to accept/reject the null

##### 🏥 Health & Medical

Use Case 1: Identify risk factors for readmission within 30 days.

Model: Multiple Logistic Regression

Question: Which comorbidities or treatment delays are associated with higher readmission?

Solve:
- Collect EHR data
- Fit model with features like age, diagnosis, length of stay
- Report odds ratios and confidence intervals

Use Case 2: Estimate average recovery time for different surgical procedures.

Model: ANOVA (Analysis of Variance)

Question: Are there significant differences in recovery time across procedures?

Solve:
- Run one-way ANOVA
- If significant, apply post-hoc tests (Tukey) to compare pairs

##### 🛡 Insurance
Use Case 1: Estimate claim severity based on policyholder and accident details.

Model: Linear Regression

Question: How does claim amount vary with age, location, vehicle type?

Solve:
- Fit linear regression
- Check residuals, multicollinearity
- Interpret impact of variables (e.g., luxury vehicles = higher payout)

Use Case 2: Evaluate whether offering a telematics discount reduces claims.

Model: Difference-in-Differences

Question: Did claims drop more among those who opted into telematics vs. those who didn’t?

Solve:
- Use pre/post data with treatment/control groups
- Fit DiD model
- Assess interaction term


🧠 Step 3: Decision Framework for Analytic Approach
- Let’s map questions → analysis type → data mining algorithm:

📊 **1. Descriptive Analytics**

Goal: Summarize patterns or trends in historical data.

Ask:
- What are key metrics over time?
- What are the distributions of variables?
- What are the common customer segments?

Approaches:
- Summary statistics, cross-tabs
- Behavioral segmentation
- Time Series Decomposition

Algorithms/Tools:
- Profiling
- Clustering (e.g., K-Means)
- Data Reduction (PCA, t-SNE)

🧪 **2. Inferential Statistics & Hypothesis Testing**

Goal: Draw conclusions about a population from a sample.

Ask:
- Is the difference between groups statistically significant?
- What’s the probability this result occurred by chance?

Approaches:
- Hypothesis Testing (t-test, ANOVA, chi-square)
- Confidence Intervals
- Probability distributions (Binomial, Normal, Poisson)
- A/B Testing / Controlled Experiments

Algorithms:
- No ML; statistical techniques suffice
- Difference-in-Differences (for observational causal studies)

🧭 **3. Causal Analysis**

Goal: Understand if X causes Y.

Ask:
- If we change X, will Y change?
- Can we estimate the treatment effect?

Approaches:
- Randomized Controlled Trials (RCT)
- Causal Inference (DoWhy, EconML)
- Propensity Score Matching
- Instrumental Variables

Algorithms:
- Causal Modeling (Graphical Models)
- Link Prediction (in networked systems)

📈 **4. Predictive Analytics**

Goal: Predict future events based on historical data.

Ask:
- Can we forecast a value or classify an outcome?
- Do we need to predict churn, fraud, claims, readmissions?

Approaches:
- Supervised Machine Learning
- Time Series Forecasting (if time-related)
- Behavioral or Risk Scoring

Algorithms:
- Classification: Logistic regression, Random Forest, XGBoost
- Regression: Linear regression, Lasso, SVR
- Similarity Matching: KNN, cosine similarity
- Time Series: ARIMA, Prophet, LSTM (for deep learning)

⏳ **5. Time Series Decomposition / Forecasting**

Goal: Identify seasonality, trend, and irregular patterns.

Ask:
- How does this variable behave over time?
- Are there predictable patterns?

Approaches:
- Time series decomposition (Trend/Seasonality/Noise)
- Forecasting (univariate or multivariate)

Algorithms:
- ARIMA/SARIMA
- Exponential Smoothing
- Prophet
- LSTM (deep learning)

💰 **6. Marketing Effectiveness / MMM**

Goal: Quantify how marketing inputs drive sales or ROI.

Ask:
- What’s the ROI of each channel?
- What happens if I increase digital ad spend?

Approaches:
- Marketing Mix Modeling (MMM)
- Attribution Modeling
- Regression with lag effects and diminishing returns

Algorithms:
- Linear and non-linear regression
- Bayesian MMM models
- Regularized regression (Ridge, Lasso for variable selection)

🧑‍🤝‍🧑 **7. Behavioral Segmentation**

Goal: Group individuals/customers by shared behaviors.

Ask:
- Can we segment our population meaningfully?
- Are there subgroups that behave differently?

Approaches:
- Clustering + Profiling
- Latent Class Analysis
- Customer Lifetime Value modeling

Algorithms:
- K-Means
- DBSCAN
- Gaussian Mixture Models (GMM)
- Hierarchical Clustering

🔗 **8. Link Prediction & Networks**

Goal: Predict new connections/interactions in networks.

Ask:
- Who will transact with whom?
- Can we detect fraud rings or referral networks?

Approaches:
- Graph embeddings
- Social network analysis
- Knowledge graphs

Algorithms:
- Node2Vec, DeepWalk
- Common Neighbors, Jaccard Similarity
- Graph Neural Networks

🔄 Mapping to Data Mining Algorithms

|Business Question Type|	Data Mining Task|	Example Tools /Algorithms|
|--------------------------|-------------------|----------------------------|
|Is this A or B? (classification)|	Classification|	Logistic Regression, XGBoost|
|How much/many? (numeric prediction)|	Regression|	Linear Regression, Lasso|
|Who/what is similar to this?|	Similarity Matching|	KNN, Cosine Similarity|
|What groups exist in my data?|	Clustering|	K-Means, DBSCAN|
|What items occur together?|	Co-occurrence Grouping|	Apriori, FP-Growth|
|What does a typical X look like?|	Profiling|	Summary Stats, Decision Trees|
|Will these entities connect?|	Link Prediction|	Graph Embeddings, Node2Vec|
|Can I reduce noise or simplify features?|	Data Reduction|	PCA, Autoencoders|
|Does X cause Y?|	Causal Modeling|	DoWhy, EconML, PSM, DiD|

## Knowing when and why to use each data mining algorithm
- Definition and Use Cases
- Business Questions That Lead You to It
- How to Choose the Right Algorithm

Key skill for a senior data scientist is not just knowing what each algorithm does, but asking the right questions to choose the right one. As a Data Science Lead, I often walk stakeholders through a simple process:

"What is the goal? Is it to **predict, explain, explore, or reduce**? What type of data and labels do we have?"

**1. Classification**

🔹 What it is:
- Supervised learning where the outcome is categorical (e.g., “Yes/No”, “High Risk/Low Risk”).

💼 When to Use:
- Fraud detection (fraudulent vs. non-fraudulent)
- Disease diagnosis (cancer/no cancer)
- Credit scoring (approve/decline loan)
- Churn prediction (will customer leave or not)

❓ Key Questions to Ask:
- Do we have labeled data with known outcomes?
- Is the outcome variable categorical?
- Are we trying to assign new records to known groups?

✅ Common Algorithms:
- Logistic Regression
- Decision Trees / Random Forest
- Gradient Boosting (XGBoost)
- Naive Bayes
- Neural Networks (for complex cases)

**2. Regression**

🔹 What it is:
- Supervised learning where the outcome is continuous (e.g., dollar value, time, rate).

💼 When to Use:
- Forecasting revenue, claims amount, or hospital stay length
- Estimating patient recovery time
- Predicting insurance premium costs

❓ Key Questions to Ask:
- Are we predicting a continuous number?
- Do we want to quantify the influence of each feature?
- Are we dealing with time trends?

✅ Common Algorithms:
- Linear Regression
- Lasso / Ridge
- Gradient Boosting Regression
- Time Series Regression (ARIMA/SARIMA)

**3. Clustering**

🔹 What it is:
Unsupervised learning for grouping similar items without predefined labels.

💼 When to Use:
- Customer segmentation
- Patient stratification (e.g., grouping by health profiles)
- Grouping similar insurance claims or policies

❓ Key Questions to Ask:
- Do we want to group or segment the data without labels?
- Are we exploring patterns rather than predicting outcomes?
- Do we want to understand common customer/patient types?

✅ Common Algorithms:
- K-Means
- Hierarchical Clustering
- DBSCAN
- Gaussian Mixture Models (GMM)

**4. Similarity Matching**

🔹 What it is:
Finding things similar to a given entity based on features.

💼 When to Use:
- Recommending similar products, policies, or treatments
- Finding patient “twins” for treatment comparison
- Matching insurance applicants to prior similar claims

❓ Key Questions to Ask:
- Do we want to recommend or match based on similarity?
- Do we have a reference case or query item?
- Are we comparing distances/similarities?

✅ Common Techniques:
- K-Nearest Neighbors (KNN)
- Cosine Similarity
- Euclidean/Manhattan Distance
- Embedding-based similarity (in NLP or deep learning)

**5. Co-Occurrence Grouping (Association Rule Mining)**

🔹 What it is:
Unsupervised learning that finds items that frequently occur together.

💼 When to Use:
- Market basket analysis
- Diagnoses that tend to co-occur (comorbidity)
- Cross-selling insurance products
- Service bundling in healthcare or financial services

❓ Key Questions to Ask:
- Are we interested in items or events that occur together?
- Are we analyzing transactional or event-based data?
- Do we want to discover patterns or rules?

✅ Common Algorithms:
- Apriori Algorithm
- FP-Growth
- Eclat Algorithm

**6. Profiling (Behavior Description)**

🔹 What it is:
Describing the typical behavior of a group or class.

💼 When to Use:
- Customer lifetime value profiling
- Describing characteristics of high-risk patients
- Profiling fraudulent claims

❓ Key Questions to Ask:
- Do we want to describe and understand the typical characteristics?
- Are we analyzing groups rather than predicting?
- Are we building a business persona or behavior archetype?

✅ Common Techniques:

Descriptive statistics
- Clustering + Summary Statistics
- Decision Trees (to explain differences between groups)

**7. Link Prediction**

🔹 What it is:
Predicting whether a relationship will form between two entities in a network.

💼 When to Use:
- Social network formation
- Identifying patient-doctor network changes
- Fraud rings in insurance or financial networks
- Recommending connections (doctors, specialists, business clients)

❓ Key Questions to Ask:
- Are we working with a network/graph of entities?
- Are we predicting future connections or interactions?
- Do we want to understand structural patterns?

✅ Common Algorithms:
- Graph Embeddings
- Node2Vec / DeepWalk
- Link-based classifiers (Common Neighbors, Jaccard, etc.)
- Knowledge Graphs

**8. Data Reduction**

🔹 What it is:
Reducing data size or dimensions while retaining important patterns.

💼 When to Use:
- Simplifying large datasets for visualization or modeling
- Speeding up computation
- Removing redundant features (e.g., in medical sensors, finance logs)

❓ Key Questions to Ask:
- Do we want to reduce complexity without losing insight?
- Is our data high-dimensional?
- Are we aiming to compress, filter, or summarize?

✅ Common Techniques:
- Principal Component Analysis (PCA)
- Autoencoders (deep learning)
- Feature selection / Variance Thresholding
- t-SNE, UMAP (for visualization)

**9. Causal Modeling**

🔹 What it is:
Uncovering causal relationships (not just correlations) between variables.

💼 When to Use:
- Determining treatment effects in healthcare (e.g., Drug A vs Drug B)
- Impact of marketing on conversion (e.g., campaign A vs. B)
- Policy effectiveness in financial or insurance interventions

❓ Key Questions to Ask:
- Do we want to understand the impact of one variable on another?
- Is correlation not enough—do we need causal inference?
- Is there an intervention or treatment whose effect needs quantifying?

✅ Common Techniques:
- Randomized Controlled Trials (RCT)
- Difference-in-Differences (DiD)
- Propensity Score Matching
- Instrumental Variables
- Causal Bayesian Networks
- DoWhy / EconML (causal libraries in Python)

Summary Table

|Data Mining Task|	Supervised?|	Output Type|	Best For|
|----------------|---------------|------------------|-----------------|
|Classification|	✅ Yes|	Categorical|	Labeling categories, decision-making|
|Regression|	✅ Yes|	Continuous|	Predicting quantities (costs, time, prices)|
|Clustering|	❌ No|	Group Labels|	Segmenting customers, patients|
|Similarity Matching|	❌ No|	Ranking/Match|	Recommendations, nearest neighbors|
|Co-occurrence Grouping|	❌ No|	Rules|	Bundling, cross-sell, comorbidity patterns|
|Profiling|	❌ No|	Descriptive|	Group behavior analysis|
|Link Prediction|	✅/❌ Mix	Binary/Score|	Predicting new connections in networks|
|Data Reduction|	❌ No|	Compressed Data	Preprocessing, dimensionality reduction|
|Causal Modeling|	✅ Yes|	Effect Estimate|	Understanding "what if" and estimating causal relationships|

### 📊 Data Mining vs. Use of Data Mining Results

|Aspect | Data Mining | Use of Data Mining Results|
|------------------------|--------------------------|-------------------------------|
|Definition | The process of discovering patterns, relationships, and insights from large datasets using algorithms | The application of those discovered patterns to drive decisions, strategies, or actions|
|Goal | To uncover hidden patterns, correlations, or trends within the data | To apply those findings to solve business problems or make data-informed decisions|
|Focus | Exploration and model building | Operationalization and action|
|Examples of Tasks | - Classification- Clustering- Regression- Association Rule Mining- Anomaly Detection | - Targeting high-value customers- Detecting fraud in real-time- Optimizing pricing|
|Techniques Used | Machine learning algorithms, statistical models, and pattern recognition | Dashboards, business rules, decision support systems, campaign tools|
|Who Performs It? | Data scientists, analysts, ML engineers | Business stakeholders, product teams, marketing, finance, and operations teams|
|Timeframe | Often exploratory and iterative | Often ongoing, used for monitoring or acting on predictions|
|Output | Models, patterns, clusters, predictive rules | Business insights, decisions, automation, customer actions|
|Example (Finance) | Using clustering to segment loan applicants | Targeting low-risk segments with preferred interest rates|
|Example (Healthcare) | Mining EHR data for disease progression patterns | Adjusting treatment plans or screening protocols based on insights|
|Example (Insurance) | Using regression to predict claims costs | Adjusting premium pricing models based on predicted risk|

Summary:
- Data Mining is the "discovery phase"—you dig into the data to uncover what's going on.
- Using Data Mining Results is the "action phase"—you take the insights and apply them to make decisions or build products.



## Statistical and Machine Learning algorithms categorization 
These can be categorized based on different criteria such as 
- learning type, 
- assumptions about data, and 
- problem-solving approach.

**Learning approach and methods**

| Algorithm  | Parametric | Non-Parametric | Lazy learning | Eager learning | Probabilistic Models | Distance-Based Models | Rule-Based Models | Ensemble Methods | Linear Models | Non-Linear Models | Traditional Time-Series Models | Machine Learning for Time-Series | Feature Engineering Required | Automatic Feature Extraction |
|-------|------|-------|-------|-------|------|-------|-------|-------|------|-------|-------|-------|------|-------|
|Linear Regression | | | | | | | | | | | | | | |
|Ridge Regression | | | | | | | | | | | | | | |
|Lasso Regression | | | | | | | | | | | | | | |
|Polynomial Regression | | | | | | | | | | | | | | |
|Poisson Regression | | | | | | | | | | | | | | |
|Gamma Regression | | | | | | | | | | | | | | |
|Bayesian Linear Regression | | | | | | | | | | | | | | |
|Support Vector Regression (SVR) | | | | | | | | | | | | | | |
|Decision Tree Regression | | | | | | | | | | | | | | |
|Random Forest Regression | | | | | | | | | | | | | | |
|XGBoost Regression | | | | | | | | | | | | | | |
|Neural Networks | | | | | | | | | | | | | | |
|Logistic Regression | | | | | | | | | | | | | | |
|Naïve Bayes | | | | | | | | | | | | | | |
|k-Nearest Neighbors | | | | | | | | | | | | | | |
|Decision Trees | | | | | | | | | | | | | | |
|Random Forest | | | | | | | | | | | | | | |
|Bagging (Bootstrap Aggregation) | | | | | | | | | | | | | | |
|Boosting (AdaBoost, Gradient Boosting, XGBoost) | | | | | | | | | | | | | | |
|Stacking | | | | | | | | | | | | | | |
|Gradient Boosting Machines (GBM) | | | | | | | | | | | | | | |
|XGBoost | | | | | | | | | | | | | | |
|Support Vector Machines (SVM) | | | | | | | | | | | | | | |
|Bayesian Networks | | | | | | | | | | | | | | |
|Artificial Neural Networks (ANNs) | | | | | | | | | | | | | | |
|Convolutional Neural Networks (CNNs) | | | | | | | | | | | | | | |
|Recurrent Neural Networks (RNNs) | | | | | | | | | | | | | | |
|Transformers | | | | | | | | | | | | | | |
|k-Means Clustering | | | | | | | | | | | | | | |
|Hierarchical Clustering | | | | | | | | | | | | | | |
|DBSCAN | | | | | | | | | | | | | | |
|Gaussian Mixture Models (GMM) | | | | | | | | | | | | | | |
|Mean-Shift Clustering | | | | | | | | | | | | | | |
|Self-Organizing Maps | | | | | | | | | | | | | | |
|Principal Component Analysis (PCA) | | | | | | | | | | | | | | |
|Independent Component Analysis (ICA) | | | | | | | | | | | | | | |
|t-Distributed Stochastic Neighbor Embedding (t-SNE) | | | | | | | | | | | | | | |
|Uniform Manifold Approximation and Projection (UMAP) | | | | | | | | | | | | | | |
|Autoencoders | | | | | | | | | | | | | | |
|Apriori Algorithm | | | | | | | | | | | | | | |
|Eclat Algorithm | | | | | | | | | | | | | | |
|FP-Growth Algorithm | | | | | | | | | | | | | | |
|Self-training Models  | | | | | | | | | | | | | | |
|Label Propagation | | | | | | | | | | | | | | |
|Generative Adversarial Networks (GANs)  | | | | | | | | | | | | | | |
|Q-Learning (Model-Free)  | | | | | | | | | | | | | | |
|Deep Q-Networks (DQN)  | | | | | | | | | | | | | | |
|Policy Gradient Methods (e.g., REINFORCE)  | | | | | | | | | | | | | | |
|Actor-Critic Algorithms (e.g., A3C, PPO)  | | | | | | | | | | | | | | |
|Monte Carlo Methods | | | | | | | | | | | | | | |
|Temporal Difference (TD) Learning  | | | | | | | | | | | | | | |
|Locally Weighted Regression | | | | | | | | | | | | | | |
|Hidden Markov Models (HMM) | | | | | | | | | | | | | | |
|ARIMA (AutoRegressive Integrated Moving Average) | | | | | | | | | | | | | | |
|SARIMA (Seasonal ARIMA)  | | | | | | | | | | | | | | |
|VAR (Vector AutoRegression)  | | | | | | | | | | | | | | |
|Exponential Smoothing  | | | | | | | | | | | | | | |
|STM (Long Short-Term Memory Networks)  | | | | | | | | | | | | | | |
|GRU (Gated Recurrent Units)  | | | | | | | | | | | | | | |
|Transformer Models for Time-Series (e.g., Time-Series Transformers)  | | | | | | | | | | | | | | |


### Assumptions About Data: Parametric vs. Non-Parametric Methods
##### **Parametric Methods**
Definition: Parametric methods make strong assumptions about the underlying distribution of the data. They assume that the data follows a specific form (e.g., Gaussian distribution) and the model is defined by a fixed number of parameters.
- Parametric models assume a fixed functional form (mathematical equation) that describes the relationship between input features and the output.

🔹 Key Characteristics:
- Assumes the data follows a specific distribution (e.g., normal distribution in linear regression).
- The model is defined by a set of parameters (e.g., weights in linear regression).
- Once trained, the number of parameters is fixed, regardless of additional data.
- Computationally efficient since fewer parameters need to be estimated.

🔹 Examples:
- Linear Regression: Assumes a linear relationship between input features and output
- Logistic Regression: Assumes a sigmoid function for binary classification.
- Naïve Bayes: Assumes feature independence based on Bayes' Theorem.
- Gaussian Mixture Models (GMM): Assumes data comes from a mixture of Gaussian distributions.
- Linear Discriminant Analysis (LDA).

Advantages:
- Simpler and faster to train because they rely on a fixed number of parameters.
- Require less data to train effectively.

Disadvantages:
- If the assumptions about the data distribution are incorrect, the model may perform poorly.
- Less flexible in capturing complex patterns in the data.

🔹 Limitations:
- Misspecification risk: If the true relationship is non-linear, a parametric model might perform poorly.
- Lack of flexibility: Assumptions may not hold for real-world data.

##### **Non-Parametric Methods**
Definition: Non-parametric methods do not make strong assumptions about the data distribution. They are more flexible and can adapt to the structure of the data.
- Non-parametric models do not assume a fixed functional form. Instead, they infer structure from data.

🔹 Key Characteristics:
- More flexible because they do not assume a predefined equation.
- The model complexity grows with data size (e.g., decision trees expand as data increases).
- They can capture complex patterns that parametric models miss.

🔹 Examples:
- Decision Trees: Splits data into branches based on feature values.
- k-Nearest Neighbors (k-NN): Memorizes all training data and makes predictions based on the closest examples.
- Support Vector Machines (SVM, with non-linear kernels): Uses kernel functions to map data into a higher-dimensional space.
- Random Forest: An ensemble of multiple decision trees.

Advantages:
- More flexible and can model complex relationships in the data.
- No need to assume a specific form for the data distribution.

Disadvantages:
- Require more data to train effectively.
- Computationally more expensive and slower to train.

🔹 Limitations:
- Higher computational cost due to storing training data.
- May require more data for good performance compared to parametric models.

Parametric vs Non-Parametric
|Feature	|Parametric	|Non-Parametric |
|-----------|-----------|---------------|
|Assumptions	|Strong assumptions about data distribution	| Minimal assumptions |
|Model Complexity	|Fixed number of parameters	|Flexible, grows with data
|Example	|Linear Regression	|k-NN, Decision Trees |
|Storage	|No training data stored	|May store training data|


### Learning Approach: Lazy Learning vs. Eager Learning
##### **Lazy Learning (Instance-Based Learning)**
Definition: Lazy learning methods delay the processing of the training data until a prediction is required. The model simply stores the training data and uses it directly to make predictions.
- Lazy learning algorithms do not learn a model during training. Instead, they store the training data and defer computation until a prediction is required.

🔹 Key Characteristics:
- No explicit model is built during training (training phase)
- Predictions involve searching through stored data 
    - made by comparing new instances to stored training instances.
- Works well for small datasets but is slow for large ones.
    - Computationally expensive during prediction but fast during training.

🔹 Examples:
- k-Nearest Neighbors (k-NN): Stores training data and finds the closest k points at prediction time.
- Case-Based Reasoning (CBR): Solves problems by referring to past cases.

🔹 Advantages:
- No need for assumptions about data distribution.
- Adapts well to new data.
- No loss of information during training.

🔹 Disadvantages:
- Computationally expensive at prediction time.
- Memory-intensive as it must store all training data.
    - Requires large memory to store the entire dataset.

##### **Eager Learning (Model-Based Learning)**
Definition: Eager learning methods build a generalized model during training, which is then used to make predictions on new data.
- Eager learning algorithms build a model during training and use it for predictions.

🔹 Key Characteristics:
- Training phase involves learning parameters.
    - A model is constructed during training, and predictions are made using this model.
- Predictions are fast as the model is pre-built.
    - Faster prediction times compared to lazy learning.

🔹 Examples:
- Linear Regression: Learns a fixed set of weights.
- Decision Trees: Constructs a tree structure from training data.
- Neural Networks: Learns representations via backpropagation.

🔹 Advantages:
- Fast inference (once trained).
- Can generalize well if trained correctly.
- Faster prediction times.
- More efficient for large datasets.

🔹 Disadvantages:
- Training can be computationally expensive.
- May require hyperparameter tuning.
- Less adaptable to new data.
- May lose information during model construction.

Lazy Learning vs Eager Learning
|Feature	|Lazy Learning|	Eager Learning|
|-----------|-------------|---------------|
|Training Effort|	Minimal|	Heavy |
|Prediction Time|	Expensive|	Fast |
|Example|	k-NN|	Decision Trees|

### Problem-Solving Approach
##### **Probabilistic Models**
Definition: These models use probability theory to model uncertainty and make predictions. They often assume that the data is generated from a probabilistic process.
- Use probability distributions to model relationships.

Example: 
- Naïve Bayes, 
- Gaussian Mixture Models (GMMs).
- Hidden Markov Models (HMM), 
- Bayesian Networks.

Use Cases: 
- Spam detection, 
- speech recognition, 
- medical diagnosis.

Bayesian vs Non-Bayesian
|Feature	|Bayesian	|Non-Bayesian |
|-----------|-----------|-------------|
|Probability-Based	|Uses probability distributions|	Directly optimizes functions|
|Example	|Naïve Bayes |	Decision Trees|

##### **Distance-Based Models**
Definition: These models rely on measuring the similarity or distance between data points to make predictions.
- Measure similarity or distance between data points.

Example: 
- k-NN, 
- k-Means Clustering.

Use Cases: 
- Recommendation systems, 
- clustering, 
- anomaly detection.

##### **Rule-Based Models**
Definition: These models use a set of rules (often in the form of if-then statements) to make decisions or predictions.
- Use if-else conditions to classify data.

Example: 
- Decision Trees, 
- Association Rule Mining.
- Rule-Based Systems, 
- Fuzzy Logic.

Use Cases: 
- Expert systems, 
- classification tasks.

##### **Ensemble Methods**
Definition: Ensemble methods combine multiple models to improve performance. They aim to reduce overfitting and increase generalization.
- Combine multiple models to improve accuracy.

Example: 
- Random Forest (Bagging), 
- XGBoost (Boosting).
- Gradient Boosting Machines (GBM), 
- AdaBoost.

Use Cases: 
- Predictive modeling, 
- classification, 
- regression.

Ensemble vs Gradient Boosting
|Feature	|Ensemble (Bagging)	| Gradient Boosting |
|-----------|-------------------|------------------|
|Core Idea	|Multiple models trained independently	|Models trained sequentially|
|Example	|Random Forest	|XGBoost |
|Error Handling	|Reduces variance	|Reduces bias |


### Function Approximation
##### **Linear Models**
Definition: Linear models assume a linear relationship between the input features and the output.
- Assumes a linear relationship.

Example: 
- Linear Regression, 
- Logistic Regression.

Advantages:
- Simple and interpretable.
- Computationally efficient.

Disadvantages:
- Limited ability to model complex relationships.

##### **Non-Linear Models**
Definition: Non-linear models can capture complex, non-linear relationships between input features and the output.
- Can capture complex relationships.

Example: 
- Decision Trees, 
- Neural Networks.
- Support Vector Machines (SVM) with non-linear kernels.

Advantages:
- Can model complex patterns in the data.

Disadvantages:
- Computationally more expensive.
- Harder to interpret.

### Time-Series & Sequential Data
##### **Traditional Time-Series Models**
Definition: These models are specifically designed for time-series data, where the order of data points matters.
- Assumes stationarity (constant statistical properties).

Examples: 
- ARIMA  (AutoRegressive Integrated Moving Average), 
- SARIMA, 
- Exponential Smoothing.

Use Cases: 
- Stock price prediction, 
- weather forecasting.

##### **Machine Learning for Time-Series**
Definition: Machine learning models can also be applied to time-series data, often by transforming the data into a format suitable for supervised learning.
- Does not require stationarity.

Examples: 
- LSTMs, 
- Transformers, 
- Gradient Boosting Trees
- Gradient Boosting Machines (GBM)
- Recurrent Neural Networks (RNN)

Use Cases: 
- Anomaly detection, 
- sequence prediction.

### Feature Learning
##### **Feature Engineering Required**
Definition: Feature engineering involves manually creating or selecting features from raw data to improve model performance.
- Manual selection of features.

Example: 
- Logistic Regression, 
- SVM.

Used to:
- Creating interaction terms, 
- scaling, 
- encoding categorical variables.

Advantages:
- Can significantly improve model performance.
- Allows domain knowledge to be incorporated.

Disadvantages:
- Time-consuming and requires expertise.

##### **Automatic Feature Extraction**
Definition: Automatic feature extraction involves using algorithms to automatically discover useful features from raw data.
- Uses algorithms to learn features automatically.

Example: 
- Neural Networks, 
- Autoencoders.
- Principal Component Analysis (PCA),
- Convolutional Neural Networks (CNN)

Advantages:
- Reduces the need for manual feature engineering.
- Can discover complex patterns in the data.

Disadvantages:
- May be less interpretable.


### Learning Types
##### **Supervised Learning**
Definition: The model is trained on labeled data, where the input features and corresponding output labels are provided.
- Learns from labeled data.

Example:
- Regression Algorithms 
    - Regression: Linear Regression (Parametric)
        - Linear regression assumes a specific functional form
        - It assumes linearity in relationships.
        - It estimates a fixed number of parameters (coefficients β).
        - Once trained, it does not store training data; predictions are based on learned parameters.
    - Regression: Support Vector Regression (SVR) (Non-Parametric)
        - SVR does not assume a fixed functional form.
        - Uses a kernel trick to transform input features into a higher-dimensional space.
        - The complexity of the model depends on the dataset, as it finds support vectors rather than learning fixed parameters.
- Classification Algorithms
    - Classification: Decision Trees  (Non-Parametric)
        - It does not assume a mathematical relationship between inputs and outputs.
        - It splits data based on information gain or Gini index, dynamically adjusting tree structure.
    - Classification: Logistic Regression (Parametric)
        - Uses a fixed logistic (sigmoid) function
        - Assumes a fixed number of parameters.
        - Does not store training data explicitly.
    - Classification: k-Nearest Neighbors (k-NN) (Non-Parametric, Lazy Learning)
        - Does not assume any functional form of the data.
        - Instead of learning parameters, it memorizes training data.
        - Why it is Lazy Learning: It only performs computations when making a prediction, not during training.
    - Classification: Random Forest (Ensemble)
        - It combines multiple Decision Trees.
        - Uses Bootstrap Aggregation (Bagging) to improve performance.
    - Classification: XGBoost (Gradient Boosting)
        - Uses an iterative approach where each new tree corrects errors of previous trees.
        - Each tree is weighted by its contribution.
    - Classification: Naïve Bayes (Bayesian)
        - Uses Bayes’ Theorem to compute class probabilities.
        - Assumes feature independence
    - Classification: Neural Networks (Deep Learning)
        - Consists of multiple layers (input, hidden, output).
        - Each layer applies transformations to data using activation functions.
        - Learns from raw data features rather than requiring manual feature engineering.

Supervised Learning Algorithms
- Used when the target variable is known (e.g., predicting fraud, classifying diseases, forecasting churn)

🔹 Classification (Categorical Outcome)
- Logistic Regression
- Decision Trees
- Random Forest
- Gradient Boosting (XGBoost, LightGBM, CatBoost)
- Support Vector Machines (SVM)
- Naive Bayes
- K-Nearest Neighbors (KNN)
- Neural Networks / Deep Learning (for classification)

🔹 Regression (Continuous Outcome)
- Linear Regression
- Ridge / Lasso / ElasticNet Regression
- Random Forest Regressor
- Gradient Boosting Regressor
- Support Vector Regression (SVR)
- Neural Networks / Deep Learning (for regression)

Use Cases: 
- Image classification, 
- spam detection.

Supervised Learning Subcategories
- Regression (Predict Continuous Values)
- Classification (Predict Discrete Labels)

##### **Unsupervised Learning**
Definition: The model is trained on unlabeled data, and the goal is to discover hidden patterns or structures.
- Learns patterns in unlabeled data.

Example: 
- Clustering Algorithms
    - Clustering: k-Means (Partitioning)
        - Divides data into k clusters based on distance metrics.
        - Iteratively updates cluster centroids.
    - Clustering: DBSCAN (Density-Based)
        - Groups points based on high-density regions.
        - Identifies noise and outliers.
    - Dimensionality Reduction: PCA

Use Cases: 
- Customer segmentation, 
- anomaly detection.

Unsupervised Learning Algorithms
- Used when the target variable is unknown (e.g., customer segmentation, anomaly detection, topic modeling)

🔹 Clustering
- K-Means
- Hierarchical Clustering
- DBSCAN
- Gaussian Mixture Models (GMM)
- Spectral Clustering

🔹 Dimensionality Reduction / Data Compression
- Principal Component Analysis (PCA)
- t-SNE (t-distributed Stochastic Neighbor Embedding)
- UMAP (Uniform Manifold Approximation and Projection)
- Autoencoders (unsupervised deep learning)

🔹 Association / Co-occurrence Grouping
- Apriori Algorithm
- FP-Growth
- Market Basket Analysis

🔹 Anomaly Detection (Unsupervised)
- Isolation Forest
- One-Class SVM
- Elliptic Envelope
- Autoencoders (for anomaly scoring)

##### Supervised Learning vs. Unsupervised Learning

|Aspect | 🧠 Supervised Learning | 🧩 Unsupervised Learning|
|-----------|-------------------------------|---------------------------|
|Definition | Learns a mapping from input features to a known output | Learns patterns or structures from unlabeled data|
|Goal | Predict or classify labeled outcomes | Discover hidden patterns, groupings, or structure|
|Input Data | Features + known target (label) | Features only (no target/label)|
|Output | Predicted label or value | Clusters, dimensions, or associations|
|Examples of Problems | Fraud detection, disease classification, risk prediction | Customer segmentation, anomaly detection, topic modeling|
|Key Algorithms | - Linear Regression- Logistic Regression- Decision Trees- Random Forest- XGBoost- SVM- Neural Networks | - K-Means- Hierarchica -Clustering- DBSCAN- PCA- t-SNE- Apriori- Autoencoders|
|Evaluation Metrics | Accuracy, Precision, Recall, AUC, RMSE, MAE | Silhouette Score, Inertia, Cluster Purity, Explained Variance
Requires Labeled Data? | ✅ Yes | ❌ No|
Common Use Cases | - Predicting loan defaults- Diagnosing diseases- Forecasting stock prices | - Behavioral segmentation- Market basket analysis- Reducing dimensionality|
|Interpretability | Often more interpretable depending on the model | Depends on method, often more exploratory|
|Complexity | Can be high if overfitting/underfitting needs tuning | Often involves trial-and-error and visual validation|

**How to Choose?**

Ask:

1. Do I know what I’m trying to predict?
    - Yes → Supervised Learning
    - No → Unsupervised Learning
2. Do I have labeled outcome data?
    - Yes → Supervised
    - No → Unsupervised
3. Am I trying to understand structure or reduce dimensionality?
    - Yes → Unsupervised

##### **Semi-Supervised Learning**
Definition: Combines a small amount of labeled data with a large amount of unlabeled data to improve learning accuracy.
- Uses both labeled and unlabeled data.
- Used when only some data points have labels.

Example: 
- Self-training (use confident predictions to label new data).
- Label Propagation (propagate labels to similar data points).
- Co-training.

Use Cases: 
- Speech recognition, 
- web content classification.

##### **Reinforcement Learning**
Definition: The model learns by interacting with an environment and receiving feedback in the form of rewards or penalties.
- Learns through trial and error.

Example: 
- Q-Learning: Uses a Q-table to store action-reward values
- Deep Q-Networks (DQN): Uses a neural network to approximate Q-values.

Use Cases: 
- Game playing, 
- robotics.

##### **Self-Supervised Learning**
Definition: A form of unsupervised learning where the data provides its own supervision by generating labels from the input data.

Examples: 
- Contrastive Learning, 
- Masked Language Models (e.g., BERT).

Use Cases: 
- Natural language processing, 
- computer vision.

##### **Transfer Learning**
Definition: A model trained on one task is reused as the starting point for a model on a second task.

Examples: 
- Fine-tuning pre-trained models (e.g., GPT, ResNet).

Use Cases: 
- Image recognition, 
- NLP tasks.

##### **Online Learning**
Definition: The model is updated continuously as new data arrives, rather than being trained on a static dataset.

Examples: 
- Stochastic Gradient Descent (SGD), 
- Online k-Means.

Use Cases: 
- Real-time recommendation systems, 
- fraud detection.

##### **Active Learning**
Definition: The model actively queries the user or an oracle to label new data points that are most informative for improving the model.

Examples: 
- Uncertainty Sampling, 
- Query-by-Committee.

Use Cases: 
- Medical diagnosis, 
- rare event detection.

Deep Learning vs Traditional Machine Learning
|Feature	|Deep Learning|	Traditional ML|
|-----------|-------------|---------------|
Feature Engineering|	Automatic|	Manual |
Model Complexity|	High |	Low  |
Example|	CNNs, RNNs|	Logistic Regression, k-NN |

## Choosing the Right Statistical Model

The type of statistical model you use depends on your data and problem:

- Linear Regression: For predicting a **continuous target variable** based on one or more predictors.
- Logistic Regression: For predicting a **binary outcomes**, often used in classification problems.
- ANOVA (Analysis of Variance): For comparing means across multiple groups.
- Time Series Models: For data that’s ordered by time (e.g., ARIMA, SARIMA).
- Survival Analysis: For time-to-event data, such as customer churn timing.
- Multivariate Analysis: For understanding interactions across multiple variables (e.g., MANOVA, PCA).

## Preprocessing the Data
Prepare your data by cleaning and preprocessing it:

- Missing Values: Decide whether to impute or drop missing values.
- Outliers: Identify and consider handling outliers, especially in regression.
- Data Transformation: Transform non-normal variables if required (e.g., using log transformations).
- Feature Scaling: For some models, standardizing or normalizing data is essential.


## Exploratory Data Analysis (EDA)

EDA is essential to understand: 
- patterns,
    - visualizations
- distributions,
    - summary statistics
- relationships
    - correlation matrices
    
This is to identify relevant features and spot potential issues like multicollinearity.

## Building the Statistical Model

- **Statsmodels** provides 
    - coefficients, 
    - p-values, and 
    - confidence intervals for each variable, 
        - enabling hypothesis testing on whether each predictor significantly affects the outcome.

## Evaluating Model Performance
Regression Metrics: 
- Use R-squared, 
- Adjusted R-squared, 
- RMSE, and 
- MAE to evaluate regression models.

Classification Metrics: 
- Use confusion matrix, 
- accuracy, 
- precision, 
- recall, and 
- AUC-ROC.

Residual Analysis: 
- Residual plots help assess assumptions
    - homoscedasticity, 
    - normality of residuals).

## Model Interpretation
Statistical models are highly interpretable. 
- In linear regression, each coefficient represents the expected change in the dependent variable for a one-unit change in the predictor, holding all else constant.

Confidence Intervals: 
- Look at 95% CI for each coefficient; if it does not contain zero, it suggests the predictor has a statistically significant effect.

P-Values: 
- A p-value below a threshold (usually 0.05) indicates that the predictor significantly affects the outcome.

## Validating Assumptions
- Linearity: Check scatter plots of residuals.
- Normality of Residuals: Use a Q-Q plot to verify.
- No Multicollinearity: Variance inflation factor (VIF) helps detect multicollinearity.
- Homoscedasticity: Plot residuals vs. fitted values.

## Engaging with Senior Stakeholders
When you're talking to a senior stakeholder like the CFO, your focus should be on understanding the business context, desired outcomes, and constraints.

🧠 1. Understanding the Business Problem

- What is the core strategic objective you’re aiming to achieve?
    - (e.g., “Reduce fraudulent claims,” “Optimize operating costs by 10%,” “Improve cash flow predictability”)
- Why is this objective important now? What’s the urgency or driver?
    - (Regulatory change? Competitive pressure? Rising losses?)
- What does success look like in measurable terms?
    - (E.g., “Cut fraud losses by $2M/year” or “Reduce manual processing time by 30%”)
- Are there known bottlenecks, pain points, or suspicions we can explore with data?
    - (e.g., “Most fraud seems to happen in claims above $5,000”)

🔍 2. Scoping the Use Case

- Which part of the process are we focusing on first?
    - (Detection? Prevention? Recovery? A specific product or geography?)
- Do we already have KPIs or benchmarks for this area?
    - (That helps compare model performance later)
- What constraints or risks should we be aware of?
    - (Compliance, budget, timeline, staff resources, reputational concerns)

📊 3. About Data & Resources

- What internal systems or data sources can we tap into?
    - (Claims system, finance ledger, transaction logs, case audit reports)
- Who owns the data, and can we access it securely?
- Is there a team or SME (subject matter expert) we can work with for context?

💰 4. About Value & Prioritization

- Where do you believe the biggest value or impact lies?
    - (This guides prioritization—e.g., is it high-value fraud, or many small cases?)
- If we find something actionable, are there teams ready to act on it?
    - (E.g., adjust controls, launch an investigation, or redesign workflows)
- How frequently do you want updates, and what format do you prefer for insights?
    - (This sets stakeholder communication expectations: dashboard, report, meeting)

5. 📣 Communication & Success Metrics

- How would you like to receive updates?
    - e.g., Executive summary, dashboard, workshop, email report
- How often would you prefer updates or reviews?
    - Weekly, biweekly, monthly?
- What metrics or indicators matter most to you when tracking success?
    - e.g., cost savings, detection rate, ROI, reduced processing time

6. Final / Conclusion Questions
- Who else should we speak with to understand the process or data better?
- Are there past initiatives (successful or failed) that we should be aware of?
- Are there strategic timelines or board deadlines we should align with?

## Reporting and Communicating Results
Present your findings by focusing on:

- Key Coefficients: Explain which predictors significantly affect the outcome.
- Model Fit: Interpret R-squared values (e.g., explaining how much variance in the target variable is explained).
- Real-World Implications: Describe how insights from the model can impact business decisions.

# Approach to statistical modeling

Each model type has specific 
- applications, 
- strengths, and 
- limitations, 

Understand when and how to use them.

### Step 1: Define Objectives and Hypotheses

Identify the Problem and Objectives: 
- Clearly define the goal.
    - Are you trying to predict, classify, find patterns, or estimate relationships? 
    - Setting objectives helps in choosing the right model.

- Formulate Hypotheses: 
    - Based on the problem, develop hypotheses. 
        - For instance, in a sales prediction problem, you may hypothesize that `certain features like advertising spend, time of year, and economic indicators affect sales.`

### Step 2: Data Collection and Preprocessing
Data Collection: 
- Gather historical data related to the problem. 

Data Cleaning: 
- Handle missing values, remove duplicates, and ensure consistency.

Feature Engineering: 
- Create new features if necessary. 
- This could involve 
    - transformations, 
    - encoding categorical variables, or 
    - creating interaction terms.

full data cleaning pipeline in Python that automates common cleaning tasks using Pandas, NumPy, and Scikit-learn. This pipeline includes handling:
- ✅ Missing values
- ✅ Duplicates
- ✅ Inconsistent data types
- ✅ Categorical inconsistencies
- ✅ Outliers
- ✅ Imbalanced data

In [None]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE

def clean_data(df):
    """Automated Data Cleaning Pipeline"""
    
    # 1️⃣ Remove Duplicate Rows
    df = df.drop_duplicates()
    
    # 2️⃣ Handle Missing Values
    num_cols = df.select_dtypes(include=np.number).columns
    cat_cols = df.select_dtypes(include="object").columns
    
    # Impute numerical missing values with median
    num_imputer = SimpleImputer(strategy='median')
    df[num_cols] = num_imputer.fit_transform(df[num_cols])

    # Impute categorical missing values with mode
    for col in cat_cols:
        df[col].fillna(df[col].mode()[0], inplace=True)
    
    # 3️⃣ Standardize Categorical Values (Lowercase, Remove Spaces)
    df[cat_cols] = df[cat_cols].apply(lambda x: x.str.lower().str.strip())

    # 4️⃣ Convert Data Types (Ensure numerical columns are numeric)
    for col in num_cols:
        df[col] = pd.to_numeric(df[col], errors='coerce')

    # 5️⃣ Handle Outliers (Using IQR method)
    for col in num_cols:
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        df[col] = np.where(df[col] < lower_bound, lower_bound, df[col])
        df[col] = np.where(df[col] > upper_bound, upper_bound, df[col])

    # 6️⃣ Encode Categorical Variables
    for col in cat_cols:
        le = LabelEncoder()
        df[col] = le.fit_transform(df[col])
    
    return df

# Example Usage:
df = pd.DataFrame({
    'Age': [25, np.nan, 30, 35, 200],  # Outlier in Age
    'Salary': [50000, np.nan, 60000, 65000, 70000],
    'City': ['New York', 'new york ', 'Los Angeles', np.nan, 'San Francisco'],
    'Gender': ['M', 'F', 'm', np.nan, 'F']
})

cleaned_df = clean_data(df)
print(cleaned_df)

# Assume 'target' is the label column
X = cleaned_df.drop(columns=['target'])
y = cleaned_df['target']

# Apply SMOTE for balancing
smote = SMOTE()
X_resampled, y_resampled = smote.fit_resample(X, y)

# Standardizing numerical features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_resampled)

# Splitting into Train/Test sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y_resampled, test_size=0.2, random_state=42)


Advanced data cleaning pipeline that includes:
- ✅ Handling missing values (numeric + categorical)
- ✅ Removing duplicates
- ✅ Fixing categorical inconsistencies
- ✅ Handling outliers (IQR method)
- ✅ Feature selection using correlation analysis
- ✅ Handling imbalanced data (SMOTE)
- ✅ Feature scaling (StandardScaler)

In [None]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.feature_selection import VarianceThreshold
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE

def clean_data(df, target_column=None):
    """Automated Data Cleaning Pipeline with Feature Selection and Imbalance Handling"""

    # 1️⃣ Remove Duplicate Rows
    df = df.drop_duplicates()

    # 2️⃣ Identify Numerical and Categorical Columns
    num_cols = df.select_dtypes(include=np.number).columns.tolist()
    cat_cols = df.select_dtypes(include="object").columns.tolist()
    
    # 3️⃣ Handle Missing Values
    num_imputer = SimpleImputer(strategy='median')
    df[num_cols] = num_imputer.fit_transform(df[num_cols])

    for col in cat_cols:
        df[col].fillna(df[col].mode()[0], inplace=True)

    # 4️⃣ Standardize Categorical Values (Lowercase, Remove Spaces)
    df[cat_cols] = df[cat_cols].apply(lambda x: x.str.lower().str.strip())

    # 5️⃣ Convert Data Types (Ensure numerical columns are numeric)
    for col in num_cols:
        df[col] = pd.to_numeric(df[col], errors='coerce')

    # 6️⃣ Handle Outliers (Using IQR Method)
    for col in num_cols:
        Q1 = df[col].quantile(0.25)
        Q3 = df[col].quantile(0.75)
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        df[col] = np.where(df[col] < lower_bound, lower_bound, df[col])
        df[col] = np.where(df[col] > upper_bound, upper_bound, df[col])

    # 7️⃣ Encode Categorical Variables
    for col in cat_cols:
        le = LabelEncoder()
        df[col] = le.fit_transform(df[col])

    # 8️⃣ Feature Selection (Removing Low Variance Features)
    selector = VarianceThreshold(threshold=0.01)
    df = pd.DataFrame(selector.fit_transform(df), columns=df.columns[selector.get_support()])

    # 9️⃣ Handle Imbalanced Data (SMOTE) if Target Column is Provided
    if target_column and target_column in df.columns:
        X = df.drop(columns=[target_column])
        y = df[target_column]

        smote = SMOTE(random_state=42)
        X_resampled, y_resampled = smote.fit_resample(X, y)

        df = pd.DataFrame(X_resampled, columns=X.columns)
        df[target_column] = y_resampled

    return df

# Example Usage:
df = pd.DataFrame({
    'Age': [25, np.nan, 30, 35, 200],  # Outlier in Age
    'Salary': [50000, np.nan, 60000, 65000, 70000],
    'City': ['New York', 'new york ', 'Los Angeles', np.nan, 'San Francisco'],
    'Gender': ['M', 'F', 'm', np.nan, 'F'],
    'target': [0, 1, 0, 1, 0]  # Binary target variable for classification
})

cleaned_df = clean_data(df, target_column='target')
print(cleaned_df)

# Separate Features and Target
X = cleaned_df.drop(columns=['target'])
y = cleaned_df['target']

# Scale Numerical Features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split into Train/Test Sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)


#### Data Splitting: 
- Split the data into training and testing sets. Typically, an 80-20 or 70-30 split is used.

In machine learning, datasets are typically divided into three parts: training set, validation set, and test set. Each serves a specific purpose in building and evaluating models.

**Training Set**

🔹 Purpose: Used to train the machine learning model.

🔹 What Happens Here?
- The model learns patterns, relationships, and structures from the data.
- It updates its parameters (e.g., weights in a neural network) to minimize error.
    - 🔹 Size Considerations: Usually the largest portion of the dataset (e.g., 60-80%).

**Validation Set**

🔹 Purpose: Used to tune hyperparameters and prevent overfitting.

🔹 What Happens Here?
- Helps in model selection by comparing different algorithms or configurations.
- Used for hyperparameter tuning (e.g., learning rate, number of layers in a neural network).
- If the model performs well on the training set but poorly on the validation set, it might be overfitting.
    - 🔹 Size Considerations: Typically 10-20% of the data.

📌 Important Note: In some cases, cross-validation (e.g., k-fold cross-validation) is used instead of a fixed validation set.

**Test Set**

🔹 Purpose: Used to evaluate the final model’s performance on unseen data.

🔹 What Happens Here?
- It simulates real-world deployment by testing the model on completely new data.
- The test set should not be used during training or tuning.
    -🔹 Size Considerations: Usually 10-20% of the dataset.


| Dataset| Used for  | Seen by Model During Training?    | Used for Model Selection?  |
|---------------|--------|----------------------------------| -----------------------|  
|Training Set| Learning patterns |	✅ Yes |  ❌ No   |
|Validation Set	|Hyperparameter tuning, model selection | ✅ Yes (but not for learning)     |  ✅ Yes    |
|Test Set | Final evaluation |		❌ No     | ❌ No   | 

##### Key Concepts Related to Instances
Labeled vs. Unlabeled Instances
- Labeled Instance: Has both input features and the target variable (used in supervised learning).
- Unlabeled Instance: Has only input features but no target variable (used in unsupervised learning).

Training, Validation, and Test Instances
- Training Instances: Used to train the model.
- Validation Instances: Used to tune hyperparameters.
- Test Instances: Used to evaluate final model performance.

Instance vs. Feature
- An instance is a full data point (row in a table).
- A feature is an individual attribute describing an instance (column in a table).


## Exploratory Data Analysis

### Why is EDA important?

Exploratory Data Analysis (EDA) helps us to understand our data without making any assumptions. EDA is a vital component before we continue with the modelling phase as it provides context and guidance on the course of action to take when developing the appropriate model. It will also assist in interpreting the results correctly. Without doing EDA you will not understand your data fully.


### The different types of EDA

EDA are generally classified in two ways:

    1) Non-graphical or Graphical
    2) Univariate or Multivariate
    
<div align="left" style="width: 600px; text-align: left;">
<img src="https://github.com/Explore-AI/Pictures/blob/f860f39251c523eda779dea0140316ccbefdd8e0/eda_map.jpg?raw=True"
     alt="EDA Diagram"
     style="padding-bottom=0.5em"
     width=600px/>
</div>


#### Non-graphical EDA
Involves calculations of summary/descriptive statistics. 

#### Graphical EDA
This type of analysis will contain data visualisations.

#### Univariate Analysis 
This is performed on one variable at a time as the prefix 'uni' indicates. 

#### Multivariate Analysis 
This type of analysis explores the relationship between two or more variables. 
When only comparing two variables it is known as **bivariate analysis** as indicated by the prefix 'bi'.

Read a more detailed explanation <a href="https://www.stat.cmu.edu/~hseltman/309/Book/chapter4.pdf">here</a>.

### 1. Basic Analysis

For a practical example, we will be looking at the Medical Claims Data. Using these four commands, we will perform a basic analysis:

    - df.head()
    - df.shape
    - df.info()
        - feature (variable) is categorical the Dtype is object and if it is a numerical variable the Dtype is an int64 or float64. 
        - This command also shows us that out of the 1338 none of the features contain any null values.
    - df.describe()

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv('https://raw.githubusercontent.com/Explore-AI/Public-Data/master/Data/regression_sprint/claims_data.csv')

# Looking at the top five rows of our data
df.head()

# shape command shows us that we have x rows of data and y features.
df.shape

#  confirms our categorical and numerical features.
df.info()

# Null values for each feature can also be checked by using the following command
df.isnull().sum()

# Population and Sample

**Population**
- Population is a collection of all data points of interest.
    - eg: Total number of employees in the organization is known as population
- **Parameter**
    - numbers that are obtained when working with a population
        - eg: total number of employees working in an organization and After completion of our survey, we arrive at number – 20000. 
    
**Sample**
- Sample is a subset of the population.
    - eg: Total number of employees in a project is known as a sample.
- **Statistic**
    - numbers that are obtained when working with a sample
        - eg: count the total number of employees working on a particular project. After completion of our survey, we arrive at number – 20.

What to chose between Population and Sample?

The real-life case scenarios, we always deal with sample data. 
- The reason behind this is that a sample is easy to collect and easier to compute than the population. 
- Based on the result that we obtained for a sample, we can then use predictive analytics to make predictions about the entire population.

# The Measure of Central tendency

The concept of central tendency is based on the below fact –
- “Provided with a larger number of observations of similar type, most of the observations seems to cluster around central position when represented as a graph”.

# Univariate Analysis: Non-Graphical

The first univariate analysis will be non-graphical. This is where we will be looking at the **descriptive statistics** of each feature. 

## Continous/Numeric Feature

##### **Descriptive Statistics**

We can get the descriptive statistics of each **numerical feature** by using the following command:

    - df.describe()

This command will provide the 
- mean, 
    - also known as the arithmetic mean 
    - is the statistical average of all data points in question.
- standard deviation and
- The five number summary of each numerical feature.
    - Minimum, 
    - Lower Quartile (Q1) = 25%,
    - Median (Q2) = 50%, 
        - middlemost data point in the dataset when arranged in ascending or descending order.
        - Higher resistance to outlier as compared to mean
        - Median with even number of data points = average of the middle two numbers.
        - Median with an odd number of data points = middlemost observation.
    - Upper Quartile (Q3) = 75%, 
    - Maximum is also used for creating the box plot.
        - exposes **Outlier**: is a data point that is significantly different from the rest of the data points in consideration.

Individual statistical measures can also be calculated by using the following commands:

    - df.count()
    - df.mean()
    - df.std()
    - df.min()
    - df.quantile([0.25, 0.5, 0.75], axis = 0)
    - df.median()
    - df.max()

The three measures for central tendency are the:
- mode
    - Mode is basically the value that appears the most in the dataset. 
- mean and 
- median**. 

The command to determine the mode is:

    - df.mode()

In [None]:
df.describe()

# statistics of a specific feature
df.age.describe()
df['age'].describe()

##### **Dispersion of Data**

Dispersion of data used to understands the distribution of data.
- Helps to understand the variation of data and provides a piece of information about the distribution data.

These include: 
- Range,
     - measure by subtracting the lowest value from the massive Number. 
          - The wide range indicates high variability,
          - The small range specifies low variability in the distribution.
     - Range = Highest_value  – Lowest_value
          - range can be influence by outliers
- Interquartile Range (IQR),
     - IQR is a range (the boundary between the first and second quartile) and Q3 (the boundary between the third and fourth quartile).
     - IQR is preferred over a range as, like a range, IQR does not influence by outliers. 
     - IQR is used to measure variability by splitting a data set into four equal quartiles.
          - IQR uses a box plot to find the outliers.
               - Formula to find outliers: [Q1 – 1.5 * IQR, Q3 + 1.5 * IQR]
- Variance, 
     - Variance measures how far each number in the dataset from the mean.

Population variance
$$\sigma^2 = \frac{\sum (x_i - \mu)^2}{n}$$
sample variance
$$ s^2 = \frac{\sum (x_i - \bar{x})^2}{n - 1}$$  

- Standard Deviation
     - Standard deviation is a squared root of the variance to get original values. 
     - Low standard deviation indicates data points close to mean.
         -  68 % of values lie within 1 standard deviation.
         - 95 % of values lies within 2 standard deviation.
         - 99.7 % of values lie within 3 standard deviation.

Population std
$$\sigma = \sqrt{\frac{1}{N}\sum (x_i - \mu)^2}$$
sample std
$$ s = \sqrt{\frac{1}{n - 1}\sum (x_i - \bar{x})^2}$$

##### Standard deviation and Mean Absolute deviation (Why SD is more reliable than MAD)


# Univariate Analysis: Graphical

Objective:
- Trends and Patterns of data
- Frequency
- Distribution of the variables
- Relationship that may exist between different variables

You can look at the **distribution** of any numerical feature by using the following plots:
- Scatter plot
- histogram
- density plot
- box plot
- violin plot
    
For a categorical feature we will use a:
- bar plot

## Continous/Numerical variable

### Uni-variate summary plots :
These plots give a more concise description of the location, dispersion, and distribution of a variable than an enumerative plot. 
- Summarizing every individual data value in a plot isn’t feasible, but it efficiently represents the entire dataset,

#### Histogram and Density Plot

For displaying a histogram and density plot we will be using the Matplotlib library and create a list of all numerical features to visualise these features at the same time.

 both the histogram and density plot display the same information. The density plot can be considered a smoothed version of the histogram and does not depend on the size of bins.

In [None]:
features = ['age', 'bmi', 'steps', 'children', 'claim_amount'] # create a list of all numerical features
df[features].hist(figsize=(10,10))

In [None]:
df[features].plot(kind='density', subplots=True, layout=(3, 2), sharex=False, figsize=(10, 10));

#### Box Plot and Violin Plot

For the Box Plot and Violin Plot, we will use the seaborn library and only select one feature instead of all the numerical features. We can visualise all numerical features simultaneously, but as the range of values for each feature is different, it will not create a useful visualisation. Standardisation or normalisation can be applied to a feature to adjust the range, but we will not apply it in this notebook. Further reading on standardisation and normalisation can be done <a href="https://medium.com/@dataakkadian/standardization-vs-normalization-da7a3a308c64">here</a>.

The `bmi` feature will be used.

Although both the box plot and violin plot display the distribution of the data, the boxplot provides certain statistics that are useful. 

The five vertical lines in the boxplot provide the information of the five number summary and the dots on the right hand side of the graph is a display of outliers. The violin plot focuses more on a smoothed distribution.

In [None]:
sns.boxplot(x='bmi', data=df)

sns.set(rc={'figure.figsize':(9,9)})
sns.boxplot(x = 'var', y = 'value', data = pd.melt(dfm))

In [None]:
sns.violinplot(x='bmi', data=df)


### Univariate enumerative Plots

#### Scatter plot

Plots different observations/values of the same variable corresponding to the index/observation number.
- plot the variable
- against the corresponding observation number stored as the index of the data frame (df.index)

In [None]:
plt.scatter(df.index, df['var1'])
plt.show()

In [None]:
sns.scatterplot(x= df.index , y= ['var'], hue = df['variety'])

# In seaborn, the ‘hue’ parameter, an interesting feature, determines which column in the data frame to use for color encoding.

#### Line plot
A line plot visualizes data by connecting the data points via line segments. 
- It resembles a scatter plot but differs by ordering the measurement points (usually by their x-axis value) and connecting them with straight line segments.

In [None]:
sns.set(rc = {'figure.figsize': (7,7)})
sns.set(font_scale= 1.5)

fig = sns.lineplot(x = df.index, y= df['var2'], markevery = 1, marker = 'd', data = df, hue = df[variety])

#### Strip plot and Swarm Plot :
- The strip plot is similar to a scatter plot.
    - helps to plot the distribution of variables for each category as individual data points.
- The swarm-plot, similar to a strip-plot, provides a visualization technique for univariate data to view the spread of values in a continuous variable.
    - The only difference between the strip-plot and the swarm-plot is that the swarm-plot spreads out the data points of the variable automatically to avoid overlap and hence provides a better visual overview of the data.

In [None]:
sns.stripplot(y=df['var1'])
sns.stripplot(x= df['variety',y=df['var1'])

In [None]:
sns.set(rc = 'figure.figsize': (5,5))
sns.swarmplot(x = df['var'])
sns.swarmplot(x = df['variety'], y = df['var'])

### Catagorical Data

#### Bar Plot

For the categorical features, we can create a **bar plot** to display the frequency distribution. 

plot on a two-dimensional axis. 
- One axis is the category axis indicating the category, while the 
- second axis is the value axis that shows the numeric value of that category, indicated by the length of the bar.

We'll generate a bar plot of the `children` feature, where each bar represents a unique number of children from the data, and the height represents how many times that number of children occurred. This can be done by using seaborn's `countplot`. 

In [None]:
df['var'].value_counts().plot.bar()

In [None]:
sns.countplot(x = 'children', data = df, palette="hls")
plt.title("Distribution of Children")

##### Pie Chart:
Shows the numerical proportion occupied by each category
-  pass the array of values to the ‘labels’ parameter to add labels.

In [None]:
plt.pie(df['var'].value_counts(), labels= ['cat1', 'cat2', 'cat3'], shadow= True)

In [None]:
plt.pie(df['var'].value_counts(), startangle= 90, autopct='%.3f', labels= ['cat1', 'cat2', 'cat3'], shadow= True)

# Normal Distribution

Examples like: Birth weight, the IQ Score, and stock price return often form a bell-shaped curve.

Normal Distribution becomes essential for data scientists is the Central Limit Theorem
- theorem explains the magic of mathematics and is the foundation for hypothesis testing techniques.

### Properties of Normal Distribution
- Bell-shaped curve
    - curve is symmetric around the Mean
    - Mean, Median, and Mode are all the same.
- Normal Distribution is symmetric, which means its tails on one side are the mirror image of the other side
- also call a Gaussian Distribution
- simplify the Normal Distribution’s Probability Density by using only two parameters
    - $\mu$
    - $\sigma^2$
- Normal distribution retains the normal shape throughout, unlike other probability distributions that change their properties after a transformation. 

For a Normal Distribution:
- Product of two Normal Distribution results into a Normal Distribution
- The Sum of two Normal Distributions is a Normal Distribution
- Convolution of two Normal Distribution is also a Normal Distribution
- Fourier Transformation of a Normal Distribution is also Normal

Empirical Rule for Normal Distribution
- According to the Empirical Rule for Normal Distribution:
    - 68.27% of data lies within 1 standard deviation of the mean
    - 95.45% of data lies within 2 standard deviations of the mean
    - 99.73% of data lies within 3 standard deviations of the mean
-  almost all the data lies within 3 standard deviations. 

This rule enables us to check for Outliers and is very helpful when determining the normality of any distribution.

### Standard Normal Distribution
Standard Normal Distribution is a special case of Normal Distribution when
- $\mu$ = 0
- $\sigma$ = 1

Convert Normal Distribution into Standard Normal distribution with
$$ Z = \frac{X - \mu}{\sigma}$$

Example: Comparing Maths mark with History mark of 2 students
- who ever get the higher z-score performed better.

### Skewed Distribution

When data points cluster on one side more than the other. These types of distributions are called Skewed Distributions.

##### **kurtosis** and **skew**. 

Both kurtosis and skew are important statistical terms to be familiar with in data science. Kurtosis is the measure of outliers present in the data. **High kurtosis (>3)** indicates a large number of outliers and **low kurtosis (<3)** a lack of outliers.  Skew will indicate how symmetrical your data is. Below is a table that explains the range of values with regards to skew.

Left skewed distribution
- Mode > Median > Mean.

Right Skewed Distribution
- Mode < Median < Mean


|   Skew Value (x)  |       Description of Data      |
|:-------------------|:---------------:|
| -0.5 < x < 0.5              |Fairly Symmetrical |
| -1 < x < -0.5 | Moderate Negative Skew  | 
| 0.5 < x < 1             | Moderate Positive Skew  | 
|       x < -1     |High Negative Skew  | 
|       x > 1  |High Positve Skew | 

<div align="left" style="width: 500px; font-size: 80%; text-align: left; margin: 0 auto">
<img src="https://github.com/Explore-AI/Pictures/blob/f3aeedd2c056ddd233301c7186063618c1041140/regression_analysis_notebook/skew.jpg?raw=True"
     alt="Dummy image 1"
     style="float: left; padding-bottom=0.5em"
     width=500px/>
     For a more detailed explanation on skew and kurtosis read <a href="https://codeburst.io/2-important-statistics-terms-you-need-to-know-in-data-science-skewness-and-kurtosis-388fef94eeaa">here</a>.
</div>


The commands used to determine the skewness of data are:

    - df.skew()

### Check the **Normality** of a Distribution
- Histogram
- KDE Plots
- Q_Q Plots
- Skewness
- Kurtosis


In [None]:
df.skew()

# Closer to 0 implies fairly symmetrical.
# Above 0.3 implies  moderately skewed in a positive direction.
# Above 1 implies highly skewed.

### Kertosis

Check for Normality is Kurtosis. 

Kurtosis gives the information regarding tailedness which basically indicates the data distribution along the tails.
- For the symmetric type of distribution, the Kurtosis value will be close to Zero. We call such types of distributions as Mesokurtic distribution. 
    - Its tails are similar to Gaussian Distribution.

- If there are extreme values present in the data, then it means that more data points will lie along with the tails. In such cases, the value of K will be greater than zero.
    - Here, Tail will be fatter and will have longer distribution. We call such types of distributions as Leptokurtic Distribution.
        - As we can clearly see here, the tails are fatter and denser as compared to Gaussian Distribution:

- If there is a low presence of extreme values compared to Normal Distribution, then lesser data points will lie along the tail.
    - The Kurtosis value will be less than zero. We call such types of distributions as Platykurtic Distribution. 
        - It will have a thinner tail and a shorter distribution in comparison to Normal distribution.

The commands used to determine the kurtosis of data are:

    - df.kurtosis()

In [None]:
# Indicates a lack of outliers for all features.
df.kurtosis()

### Transform features into Normal/Gaussian Distribution
- Models such as Linear Regression, Logistic Regression, Artificial Neural Networks assume that features are normally distributed
- They perform much better if the features provided to them during modeling are normally distributed.

**What do we do when data provided to us does not necessarily follow a normal distribution?**

### Gaussian Distribution

In probability theory, a normal (or Gaussian) distribution is a type of continuous probability distribution for a real-valued random variable.
- general form of its probability density function is
$$ f(x) = \frac{1}{\sigma \sqrt{2\pi}}e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^2}$$

Samples of the Gaussian Distribution follow a bell-shaped curve and lies around the mean. 
- The mean, median, and mode of Gaussian Distribution are the same.

Steps:
1. Check if a variable is following Normal Distribution (see above)
- Checking the distribution of variables using a Q-Q plot
    - Q-Q plot is a scatterplot created by plotting two sets of quantiles against one another. If both sets of quantiles came from the same distribution, we should see the points forming a roughly straight line.
        -  if the data falls in a straight line then the variable follows normal distribution otherwise not.

Example: if variable is highly positively skewed
- plot the Q-Q plot for the variable and check.

If data points of the feature are not falling on a straight line. This implies that it does not follow a normal distribution.

In [None]:
#importing necessary libraries
import scipy.stats as stats
import pylab

stats.probplot(cp.price,plot=pylab)

##### Function in python which will take data and feature name as inputs and return the KDE plot and Q-Q plot of the feature.

In [None]:
# function to return plots for the feature
def normality(data,feature):
    plt.figure(figsize=(10,5))
    plt.subplot(1,2,1)
    sns.kdeplot(data[feature])
    plt.subplot(1,2,2)
    stats.probplot(data[feature],plot=pylab)
    plt.show()

### Performing the transformations

##### **Logarithmic Transformation**
Convert to its log value i.e log(Price)

In [None]:
# performing logarithmic transformation on the feature
cp['price_log']=np.log(cp['price'])
# plotting to check the transformation
normality(cp,'price_log')

##### **Reciprocal Transformation**
This will inverse values of Price i.e1/Price

In [None]:
cp['price_reciprocal']=1/cp.price
normality(cp,'price_reciprocal')

##### **Square Root Transformation**

This transformation will take the square root of the Price column i.e sqrt(Price).

In [None]:
cp['price_sqroot']=np.sqrt(cp.price)
normality(cp,'price_sqroot')

##### **Exponential Transformation**

The exponential value of the Price variable will be taken.

In [None]:
cp['price_exponential']=cp.price**(1/1.2)
normality(cp,'price_exponential')

##### **Box-Cox Transformation**

$$ y_i^{(\lambda)} = \{ {\frac{y_i^{(\lambda)} - 1}{\lambda} \text{if} \lambda \neq 0, \\ \ln(y_i) \text{if} \lambda = 0,}$$

where:
- y is the response variable and 
- λ is the transformation parameter. 
    - λ value varies from -5 to 5. 

During the transformation, all values of λ are considered and the optimal/best value for the variable is selected. 
- log(y) is only applied when λ=0.

Box cox is more logic-based and involves the λ variable which is chosen as per the best skewness for the data so Box cox will be a better transformation to go with.

In [None]:
cp['price_Boxcox'],parameters=stats.boxcox(cp['price'])
normality(cp,'price_Boxcox')

# Types Of Probability Distribution Function in Univeriate Analysis

Probability Distribution Function (PDF) is a mathematical way of showing how likely different outcomes are in a random event. 
- It gives probabilities to each possible result, and 
- Adding up all the probabilities, the total is always 1. 
The PDF helps us understand the chances of different outcomes in a random experiment.

### Distribution Function
- is a mathematical expression that describes the probability of different possible outcomes for an experiment.
- denoted as Variable ~ Type (Characteristics)

Data Types
- We have Qualitative and Quantitative data. 
    - Quantitative data, we have 
        - Continuous data types/ random variables. 
            - Continuous data measures and can take any number of values within a given finite or infinite range.
            - Continuous data represented in decimal format.
            - Example: 
                - person’s height, 
                - Time, 
                - distance,
        - Discrete data types.
            - Discrete data is counted and can take only a limited number of values.
            - Discrete data is represented as Whole number.
            - Example:
                - number of students in a class, 
                - number of workers in a company

### Types of distribution functions

|   Discrete distributions   |      Continuous distributions     |
|:-------------------|:---------------:|
|  Uniform distribution | Normal distribution |
| Binomial distribution | Standard Normal distribution  | 
| Bernoulli distribution  | Student’s T distribution  | 
| Poisson distribution  | Chi-squared  distribution  |

#### **Probability Density Function (PDF):**
- Statistical term that describes the probability distribution of a **continuous** random variable.
- Probability associate with a single value is always Zero.

$$F(X) = P(a \leq x \leq b) = \int^{b}_{a} f(x)dx \geq 0$$

#### **Probability Mass Function (PMF):**
- Statistical term that describes the probability distribution of a **discrete** random variable.

$$p(x) = P(X=x)$$

Where:
- probability of x = the probability X = one specific x

#### **Cumulative Distribution Function (CDF):**
- It is another method to describe the distribution of a random variable (either continuous or discrete).

$$ F_X (x) = P(X \leq x)$$

Where:
- F_X (x) = function of X
- X = real value variable
- P = probability that X will have a value less then or equal to x

### Discrete Distribution

##### **1. Discrete Uniform distribution**
- Denoted as X ~ U (a, b)
- where X is a discrete random variable that follows uniform distribution ranging from a to b.
- Uniform distribution is when all the possible events are equally likely.
- Example:
    - Experiment of rolling a dice
    - six possible events X = {1, 2, 3, 4, 5, 6} each having a probability of P(X) = 1/6.

Formula for PMF, CDF of Uniform distribution function:

|   Term   |     Fromula     |
|:-------------------|:---------------:|
|  Support | $K \in {a, a + 1, ..., b-1, b}$ |
| PMF | $\frac{1}{n}$  | 
| CDF | $\frac{[k] - a + 1}{n}$  |
| Mean | $\frac{(a + b)}{2}$  | 
| Variance | $\frac{(n^2 - 1)}{12}$  |

Case Study: Lottery Number Simulation

A lottery system allows participants to pick a number between 1 and 6, inclusive, where each number has an equal chance of being selected. 
- This setup represents a discrete uniform distribution.

PMF:
- Since each outcome is equally likely, the probability for each number from 1 to 6 will be $\frac{1}{6} ≈ 0.1667$.

CDF:
- The cumulative probabilities for the outcomes [1, 2, 3, 4, 5, 6] will increase incrementally as: [0.1667, 0.3334, 0.5001, 0.6668, 0.8335, 1.0]

Mean:
- For a discrete uniform distribution:

$$ Mean = \frac{Low + High}{2}$$
$$ = \frac{1 + 6}{2}$$
$$ = 3.5 $$

Variance:
- For a discrete uniform distribution:

$$ Variance = \frac{(High - Low + 1)^2 - 1}{12}$$
$$ = \frac{(6 -1 + 1)^2}{12}$$
$$ = \frac{35}{12} $$
$$ ≈ 2.92 $$


In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import randint

# 1. Define the parameters of the discrete uniform distribution
low, high = 1, 6  # Numbers range from 1 to 6, inclusive

# 2. Simulate the discrete uniform distribution
n_samples = 10000
samples = np.random.randint(low, high + 1, size=n_samples)

# 3. Calculate the PMF
pmf = [1 / (high - low + 1)] * (high - low + 1)  # Since it's uniform, all probabilities are equal
outcomes = np.arange(low, high + 1)

# 4. Calculate the CDF
cdf = np.cumsum(pmf)

# 5. Mean and Variance
mean = np.mean(samples)
variance = np.var(samples)

# 6. Visualization
plt.figure(figsize=(12, 6))

# PMF Plot
plt.subplot(1, 2, 1)
plt.bar(outcomes, pmf, color='skyblue', alpha=0.7)
plt.title("PMF of Discrete Uniform Distribution")
plt.xlabel("Outcomes")
plt.ylabel("Probability")
plt.xticks(outcomes)

# CDF Plot
plt.subplot(1, 2, 2)
plt.step(outcomes, cdf, where='post', color='orange', label="CDF")
plt.title("CDF of Discrete Uniform Distribution")
plt.xlabel("Outcomes")
plt.ylabel("Cumulative Probability")
plt.xticks(outcomes)
plt.legend()

plt.tight_layout()
plt.show()

# 7. Print results
print("PMF:", pmf)
print("CDF:", cdf)
print(f"Simulated Mean: {mean:.2f}")
print(f"Simulated Variance: {variance:.2f}")


##### **2. Binomial distribution**
- Denoted as X ~ B(n, p).
- where X is a discrete random variable that follows Binomial distribution with parameters n, p.
    - n is the no. of trials,
    - p is the success probability for each trial.
- Probability distribution of the number of successes in ‘n’ independent experiments sequence.
    - Binomial event suggests the no. of times a specific outcome can be expected.
- The two outcomes of a Binomial trial could be 
    - Success is denoted as 1, and the probability associated with it is p.
    - Failure is denoted as 0, and the probability associate with it is q = 1-p.
- Example: 
    - Success/Failure, 
    - Pass/Fail/, 
    - Win/Lose,


|   Term   |     Fromula     |
|:-------------------|:---------------:|
| PMF | $ \left(^{n}_{k}\right) p^k q^{n - k}$  | 
| CDF | $I_q ( n - k, 1 + k)$  |
| Mean | $n \times p$  | 
| Variance | $ n \times p \times q$  |

Case Study: Quality Control in Manufacturing
- A manufacturing plant produces light bulbs. We inspect a batch of 10 bulbs. Each light bulb has a: 
    - 90% probability of passing quality control (success) and a 
    - 10% probability of failing (failure). 

PMF: 
- The PMF provides the probability of having exactly 𝑘 successes in n trials:
- For example, P(X=9) represents the probability that 9 out of 10 light bulbs pass quality control.

$$ \left(^{n}_{k}\right) p^k q^{n - k}$$

CDF: 
- The CDF provides the cumulative probability of having up to k successes:
$$ P(X \leq k) = \sum P(X = i) $$


Mean: For a Binomial distribution:
$$ Mean = n \times p$$
$$ = 10 \times 0.9 $$
$$ =0 $$

Variance: For a Binomial distribution:

$$ Variance= n \times p \times (1−p) $$
$$ =10⋅0.9⋅0.1 $$
$$ =0.9 $$

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import binom

# 1. Define the parameters of the Binomial distribution
n = 10  # Number of trials (light bulbs in the batch)
p = 0.9  # Probability of success (passing quality control)

# 2. Simulate the Binomial distribution
n_samples = 10000
samples = np.random.binomial(n, p, size=n_samples)

# 3. Calculate the PMF
x = np.arange(0, n + 1)  # Possible outcomes: 0 to n successes
pmf = binom.pmf(x, n, p)

# 4. Calculate the CDF
cdf = binom.cdf(x, n, p)

# 5. Mean and Variance
mean = n * p  # Mean of a Binomial distribution
variance = n * p * (1 - p)  # Variance of a Binomial distribution

# 6. Visualization
plt.figure(figsize=(12, 6))

# PMF Plot
plt.subplot(1, 2, 1)
plt.bar(x, pmf, color='skyblue', alpha=0.7, label='PMF')
plt.title("PMF of Binomial Distribution")
plt.xlabel("Number of Successes")
plt.ylabel("Probability")
plt.legend()

# CDF Plot
plt.subplot(1, 2, 2)
plt.step(x, cdf, where='post', color='orange', label='CDF')
plt.title("CDF of Binomial Distribution")
plt.xlabel("Number of Successes")
plt.ylabel("Cumulative Probability")
plt.legend()

plt.tight_layout()
plt.show()

# 7. Print results
print("PMF:", pmf)
print("CDF:", cdf)
print(f"Theoretical Mean: {mean:.2f}")
print(f"Theoretical Variance: {variance:.2f}")
print(f"Simulated Mean: {np.mean(samples):.2f}")
print(f"Simulated Variance: {np.var(samples):.2f}")


##### **3. Bernoulli distribution**
- denoted as X ~ Bern(p).
- Where X is a discrete random variable that follows Bernoulli distribution with parameter p.
    - Where p is the probability of the success.
- Bernoulli is a Binomial experiment with a single trial.
    - Bernoulli’s event suggests which outcome can be expected for a single trial.
- Example: tossing a fair. The two possible outcomes are 
    - Heads, Tails. 
    - The probability (p) associated with each of them is 1/2.
- Example: In an unfair coin
    - Heads can have a probability of p = 0.8, then the probability of tail q = 1-p = 1-0.8 = 0.2

|   Term   |     Fromula     |
|:-------------------|:---------------:|
| PMF | $ \{ q = 1- p \text{  if  } k = 0 \\ \{ p \text{  if  } k = 1 \\ p^k (1 - p)^{1 - k}$  | 
| CDF | $\{ 0 = 1- p \text{  if  } k < 0 \\ \{ 1 - p \text{  if  } 0 \leq k < 1 \\ \{ 0 = 1- p \text{  if  } k \geq  1$  |
| Mean | $ p$  | 
| Variance | $ p( 1 - p) = p \times q$  |

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import bernoulli

# 1. Define the parameters of the Bernoulli distribution
p = 0.5  # Probability of success (Heads)

# 2. Simulate the Bernoulli distribution
n_samples = 10000
samples = np.random.binomial(1, p, size=n_samples)  # Equivalent to Bernoulli

# 3. Calculate the PMF
x = [0, 1]  # Possible outcomes: 0 (Tails), 1 (Heads)
pmf = bernoulli.pmf(x, p)

# 4. Calculate the CDF
cdf = bernoulli.cdf(x, p)

# 5. Mean and Variance
mean = p  # Mean of a Bernoulli distribution
variance = p * (1 - p)  # Variance of a Bernoulli distribution

# 6. Visualization
plt.figure(figsize=(12, 6))

# PMF Plot
plt.subplot(1, 2, 1)
plt.bar(x, pmf, color='skyblue', alpha=0.7, label='PMF')
plt.title("PMF of Bernoulli Distribution")
plt.xlabel("Outcomes (0: Tails, 1: Heads)")
plt.ylabel("Probability")
plt.xticks(x)
plt.legend()

# CDF Plot
plt.subplot(1, 2, 2)
plt.step(x, cdf, where='post', color='orange', label='CDF')
plt.title("CDF of Bernoulli Distribution")
plt.xlabel("Outcomes (0: Tails, 1: Heads)")
plt.ylabel("Cumulative Probability")
plt.xticks(x)
plt.legend()

plt.tight_layout()
plt.show()

# 7. Print results
print("PMF:", pmf)
print("CDF:", cdf)
print(f"Theoretical Mean: {mean:.2f}")
print(f"Theoretical Variance: {variance:.2f}")
print(f"Simulated Mean: {np.mean(samples):.2f}")
print(f"Simulated Variance: {np.var(samples):.2f}")


##### **4. Poisson Distribution**
- Denoted as X ~ Po(λ). 
- Where X is a discrete random variable that follows Poisson Distribution with parameter λ.
    - Where λ is the expected rate of occurrences.
- It expresses the probability of a given number of events occurring in a fixed time interval.
- Examples: 
    - The number of diners at a restaurant on a given day.
    - Calls per hour at a call centre.

|   Term   |     Fromula     |
|:-------------------|:---------------:|
| PMF | $ \frac{\lambda^k e^{-\lambda}}{k!} $ | 
| CDF | $ e^{-\lambda} \sum^{[k]}_{i = 0} \frac{\lambda^i}{i!}$  |
| Mean | $ \lambda $  | 
| Variance | $ \lambda $  |

Case Study: Website Traffic
- A website receives an average of λ=3 inquiries per minute. 
- The number of inquiries in any given minute can be modeled using a Poisson distribution.

PMF:
-  for λ=3 and k=2:

$$ P(X = k ) = \frac{\lambda^k e^{-\lambda}}{k!}$$
$$ P(X = 2 ) = \frac{3^2 e^{-3}}{2!}$$
$$ 0.224$$

CDF: 

Mean:
- Mean=λ=3

Variance:
- Variance=λ=3

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import poisson

# 1. Define the parameter of the Poisson distribution
lam = 3  # Average rate (inquiries per minute)

# 2. Simulate the Poisson distribution
n_samples = 10000
samples = np.random.poisson(lam, size=n_samples)

# 3. Calculate the PMF
x = np.arange(0, 15)  # Possible outcomes (0 to 14 inquiries)
pmf = poisson.pmf(x, lam)

# 4. Calculate the CDF
cdf = poisson.cdf(x, lam)

# 5. Mean and Variance
mean = lam  # Mean of a Poisson distribution
variance = lam  # Variance of a Poisson distribution

# 6. Visualization
plt.figure(figsize=(12, 6))

# PMF Plot
plt.subplot(1, 2, 1)
plt.bar(x, pmf, color='skyblue', alpha=0.7, label='PMF')
plt.title("PMF of Poisson Distribution")
plt.xlabel("Number of Inquiries")
plt.ylabel("Probability")
plt.legend()

# CDF Plot
plt.subplot(1, 2, 2)
plt.step(x, cdf, where='post', color='orange', label='CDF')
plt.title("CDF of Poisson Distribution")
plt.xlabel("Number of Inquiries")
plt.ylabel("Cumulative Probability")
plt.legend()

plt.tight_layout()
plt.show()

# 7. Print results
print("PMF:", pmf)
print("CDF:", cdf)
print(f"Theoretical Mean: {mean:.2f}")
print(f"Theoretical Variance: {variance:.2f}")
print(f"Simulated Mean: {np.mean(samples):.2f}")
print(f"Simulated Variance: {np.var(samples):.2f}")


### Continuous Distributions

##### **1. Normal or Gaussian Distribution**
- denoted as $X ~ N (μ, σ^2)$. 
- Where  X is a continuous random variable that follows a Normal distribution with parameters μ, σ2.
    - μ is the mean. 
    - $σ^2$ is the variance.
- describes the probability of a continuous random variable that takes real values.
- Examples:
    - Heights of people, 
    - exam scores of students, 
    - IQ Scores,
- Normal distribution follows the 68-95-99.7 rule (empirical rule). 
    - 68% of data lies in the first standard deviation range, 
    - 95% of data lies in the second standard deviation range, and 
    - 99.7% of data lies in the third standard deviation range.

Properties of Normal distribution:
- The random variable takes values from -∞ to +∞
- The probability associate with any single value is Zero.
- looks like a bell curve and is symmetric about x=μ. 
    - 50% of data lies on the left-hand side and 
    - 50% of the data lies on the right-hand side.
- The area under the curve (AUC) = 1
- All the measures of central tendency coincide i.e., mean = median = mode

|   Term   |     Fromula     |
|:-------------------|:---------------:|
| PDF | $ \frac{1}{\sigma \sqrt{2\pi}} e{-\frac{1}{2}(\frac{x = \mu}{\sigma})^2} $ | 
| CDF | $ \frac{1}{2} [ 1 = erf(\frac{x = \mu}{\sigma\sqrt{2}})$  |
| Mean | $ \mu $  | 
| Variance | $ \sigma^2 $  |

Case Study: Human Heights
- Assume the heights of adults in a population are normally distributed with:
    - μ=170 cm (average height).
    - σ=10 cm (standard deviation).

PDF:
- μ=170, 
- σ=10, and
- x=180:

$$ f(x) = \frac{1}{\sigma \sqrt{2\pi}} e{-\frac{1}{2}(\frac{x = \mu}{\sigma})^2}$$ 

CDF:
- For x = 180, P(X≤180)≈0.841.

Mean:
- Mean=μ=170,

Variance: 
- Variance= $σ^2$=100

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

# 1. Define the parameters of the Normal distribution
mu = 170  # Mean (average height in cm)
sigma = 10  # Standard deviation (spread of height in cm)

# 2. Simulate the Normal distribution
n_samples = 10000
samples = np.random.normal(mu, sigma, size=n_samples)

# 3. Calculate the PDF
x = np.linspace(mu - 4 * sigma, mu + 4 * sigma, 1000)  # Range of values (±4σ)
pdf = norm.pdf(x, mu, sigma)

# 4. Calculate the CDF
cdf = norm.cdf(x, mu, sigma)

# 5. Mean and Variance
mean = mu  # Mean of a Normal distribution
variance = sigma ** 2  # Variance of a Normal distribution

# 6. Visualization
plt.figure(figsize=(12, 6))

# PDF Plot
plt.subplot(1, 2, 1)
plt.plot(x, pdf, color='skyblue', label='PDF')
plt.title("PDF of Normal Distribution")
plt.xlabel("Height (cm)")
plt.ylabel("Density")
plt.legend()

# CDF Plot
plt.subplot(1, 2, 2)
plt.plot(x, cdf, color='orange', label='CDF')
plt.title("CDF of Normal Distribution")
plt.xlabel("Height (cm)")
plt.ylabel("Cumulative Probability")
plt.legend()

plt.tight_layout()
plt.show()

# 7. Print results
print(f"Theoretical Mean: {mean:.2f}")
print(f"Theoretical Variance: {variance:.2f}")
print(f"Simulated Mean: {np.mean(samples):.2f}")
print(f"Simulated Variance: {np.var(samples):.2f}")


# Multivariate Analysis: Non-Graphical 

### Continuous - Continuous

##### **Covariance**

Statistical tool that helps to quantify the total variance of random variables from their expected value(Mean).
- it is a measure of the linear relationship between two random variables. 
- It can take any positive and negative values.
    - Positive Covariance: 
        - It indicates that two variables tend to move in the same direction, which means that if we increase the value of one variable other variable value will also increase.
    - Zero Covariance: 
        - It indicates that there is no linear relationship between them.
    - Negative Covariance: 
        - It indicates that two variables tend to move in the opposite direction, which means that if we increase the value of one variable other variable value will decrease and vice versa.

Formula:

Limitations of Covariance
 -Covariance magnitude does not signify the strength of their relationship, so what only matters is the sign, whether it is positive or negative which tells the relationship.
- If we convert or scale the measurements of the variable X and Y, then Cov(X’, Y’) ≠ Cov(X, Y) should not happen.
- Covariance does not capture the non-linear relationship between two variables.

In [None]:
import numpy as np
import pandas as pd
from scipy.stats import pearsonr, spearmanr, kendalltau, pointbiserialr

# Step 1: Simulate Two Random Variables
np.random.seed(42)  # For reproducibility
n = 100  # Number of samples

# Variable X: Continuous Random Variable
X = np.random.normal(loc=50, scale=10, size=n)  # Mean=50, Std Dev=10

# Variable Y: Continuous Random Variable
Y = 0.5 * X + np.random.normal(loc=0, scale=5, size=n)  # Linear relationship with noise

# Convert to DataFrame for easier handling
data = pd.DataFrame({'X': X, 'Y': Y})

# Step 2: Covariance Calculation
# Covariance measures how two variables vary together.
cov_matrix = np.cov(X, Y)
covariance = cov_matrix[0, 1]

print(f"Covariance between X and Y: {covariance:.4f}")

# Step 3: Pearson Correlation Coefficient
# Measures linear correlation between X and Y.
pearson_corr, pearson_p_value = pearsonr(X, Y)
print(f"Pearson Correlation Coefficient: {pearson_corr:.4f}, p-value: {pearson_p_value:.4f}")

# Step 4: Spearman's Rank Correlation
# Measures monotonic relationship between variables.
spearman_corr, spearman_p_value = spearmanr(X, Y)
print(f"Spearman's Rank Correlation: {spearman_corr:.4f}, p-value: {spearman_p_value:.4f}")

# Step 5: Kendall's Tau Rank Correlation
# Measures ordinal association between variables.
kendall_corr, kendall_p_value = kendalltau(X, Y)
print(f"Kendall's Tau Correlation: {kendall_corr:.4f}, p-value: {kendall_p_value:.4f}")

# Step 6: Point Biserial Correlation
# Requires one continuous variable and one binary variable.
# Simulate a binary variable from X.
Z = (X > np.median(X)).astype(int)  # Binary variable based on X's median
point_biserial_corr, point_biserial_p_value = pointbiserialr(Z, Y)
print(f"Point Biserial Correlation: {point_biserial_corr:.4f}, p-value: {point_biserial_p_value:.4f}")


### Correlation

For this analysis, we can **determine the relationship between any two numerical features** by calculating the **correlation coefficient**. 
- Correlation is a measure of the degree to which two variables change together, if at all. 
    - If two features have a strong positive correlation, it means that if the value of one feature increases, the value of the other feature also increases. 
    - There are three different correlation measures:
        - Pearson correlation 
        - Spearman rank correlation
        - Kendall correlation

For this lesson, we will focus on the **Pearson correlation**. The Pearson correlation measures the linear relationship between features and assumes that the features are normally distributed. Below is a table that explains how to interpret the Pearson correlation measure:

|   Pearson Correlation Coefficient (r)  |       Description of Relationship     |
|:-------------------|:---------------:|
|  r = -1              |Perfect Negative Correlation |
| -1 < r < -0.8 | Strong Negative Correlation  | 
| - 0.8 < r < -0.5             | Moderate Negative Correlation  | 
|       - 0.5 < r < 0     |Weak Negative Correlation  | 
|       r = 0  |No Linear Correlation | 
| 0 < r < 0.5 | Weak Positive Correlation  | 
| 0.5 < r < 0.8             | Moderate Positive Correlation  | 
|       0.8 < r < 1     |Strong Positive Correlation  | 
|       r = 1  |Perfect Positive Correlation | 


<div align="left" style="width: 800px; text-align: left;">
<img src="https://github.com/Explore-AI/Pictures/blob/f3aeedd2c056ddd233301c7186063618c1041140/regression_analysis_notebook/pearson_corr.jpg?raw=True"
     alt="Pearson Correlation"
     style="padding-bottom=0.5em"
     width=800px/>
</div>

For a more detailed explanation of correlations, read <a href="https://medium.com/fintechexplained/did-you-know-the-importance-of-finding-correlations-in-data-science-1fa3943debc2#:~:text=Correlation%20is%20a%20statistical%20measure,to%20forecast%20our%20target%20variable.&text=It%20means%20that%20when%20the,variable(s)%20also%20increases.">here</a>.

The command we will use to determine the correlation between features is:

    - df.corr()

In [None]:
df.corr()

# Multivariate Analysis: Graphical

For the multivariate graphical analysis the following visualisations will be considered:

    - Heatmap
    - Scatter Plot
    - Pair Plot
    - Joint Plot
    - Bubble Plot
    
#### Heatmap

The relationship between features can also be displayed graphically using a **heatmap**. The Seaborn library will be used for this basic heatmap visualisation. 

To see how different heatmap variations can be created, read <a href="https://medium.com/@szabo.bibor/how-to-create-a-seaborn-correlation-heatmap-in-python-834c0686b88e">here</a>.

The correlation coefficient value will be displayed on the heatmap using the `vmin` and `vmax` parameters.

In [None]:
heatmap = sns.heatmap(df.corr(), vmin=-1, vmax=1, annot=True)
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':12}, pad=12)

#### Scatter Plot

A Scatter plot is used to visualise the relationship between two different features and is most likely the primary multivariate graphical method. For this exercise, we will create a scatter plot to determine if there is a relationship between `bmi` and `age`. The parameter `hue` is set to the feature `insurance_claim`, colouring the points according to whether or not a claim was submitted.

In [None]:
sns.scatterplot(x='age',y='bmi',hue='insurance_claim', data=df)

#### Pair Plot

A pair plot can be used to visualise the relationships between all the numerical features at the same time. 

The `hue` is once again set to the feature `insurance_claim` to indicate which data points submitted an insurance claim and which didn't.

In [None]:
sns.set_style("whitegrid")
sns.pairplot(df, hue="insurance_claim")
plt.show()

#### Joint Plot

The joint plot can be used to provide univariate and multivariate analyses at the same time. The central part of the plot will be a scatter plot comparing two different features. The top and right visualisations will display the distribution of each feature as a histogram. 

For this joint plot, we will once again compare `age` and `bmi`.

In [None]:
sns.jointplot(x = 'age', y = 'bmi', data = df)

# including the hue as insurance_claim
sns.jointplot(x = 'age', y = 'bmi', data = df, hue='insurance_claim')

#### Bubble Plots

A bubble plot is a variation of a scatter plot. Bubbles vary in size, dependent on another feature in the data. The same applies to the colour of the bubbles; which can be set to vary with the values of another feature. This way, we can visualise up to four dimensions/features at the same time.

For this bubble plot, `bmi` and `claim_amount` will be plotted on the x-axis and y-axis, respectively. The colours of the bubbles will vary based on whether the observation is a `smoker` or not, and lastly, the size of the bubbles will vary based on the number of `children` the observation has. We will create this bubble plot by using `seaborn`’s scatter plot.

In [None]:
plt.figure(figsize=(12,8))
sns.scatterplot(x="bmi", 
                y="claim_amount",
                size="children",
                sizes=(20,100),
                alpha=0.8,
                hue="smoker",
                data=df)

## Splitting the Data
### Two-Way Split

When fitting a machine learning model to some data, we ultimately intend to use that model to make predictions/forecasts on real-world data. 
- Real-world data is unseen - it doesn't exist in the dataset we have at our disposal - so in order to validate our model (check how well it performs), we need to test it on unseen data too.
- Gathering unseen data is not as simple as collecting it from outside the window and exposing it to the model: any new data would need to be 
    - cleaned, 
    - wrangled and 
    - annotated just like the data in our dataset.
- The next best thing, then, is to simulate some unseen data, which we can do using the existing dataset by splitting it into two sets:
    - One for training the model; and
    - A second for testing it.
   
We fit a model using the training data, and then assess its accuracy using the test set.
- use 80% of the data for training and 
    - the training set will contain 80% of the rows, or data points,
- keep 20% aside for testing. 
    - and the remaining 20% of rows will be in the test set.
These rows are selected at random, to ensure that the mix of data in the train set is as close as possible to the mix in the test set.

### Three-Way Split

Many academic works on machine learning talk about splitting the dataset into three distinct parts: 
- `train`, 
    - training set is used to fit the model to the observations.
- `validation,` and
    -  during the model tuning process where hyperparameters are tweaked and decisions on the dataset is made, the validation set is used to test the performance of the model.
- `test` sets. 
    - Once the model designer is satisfied with the performance of the model on the validation set, the previously unseen test set is brought out and used to provide an unbiased evaluation of a final model fit on the training dataset.

#### Caveats for using a validation set

On small datasets, it may not be feasible to include a validation set for the following reasons, both of which should be intuitive:

- The model may need every possible data point to adequately determine model values;
- For small enough test sets, the uncertainty of the test set can be considerably large to the point where different test sets may produce very different results.

Clearly, further splitting the training data into training and validation sets would remove precious observations for the training process.

### Cross-Validation

In the case that the designer does not desire to use a validation set, or there is simply not enough data, 
- a technique known as cross validation may be used. 
A common version of cross validation is known as K-fold cross validation: 
- during the training process, some proportion of the training data, say 10%, is held back, and effectively used as a validation set while the model parameters are calcuated.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import rc
import seaborn as sns

# Import the split function from sklearn
from sklearn.model_selection import train_test_split

In [None]:
# Split the dataset into the response, y, and features, X
y = df['ZAR/USD']
X = df.drop('ZAR/USD', axis=1)

Understand the four parameters to hand to the splitting function.

- `X` contains the features on which we will be training the model. In this case: just `exports`;
- `y` is the response variable, that which we are trying to predict. In this case: `exchange rate`;
- `test_size` is a value between 0 and 1: the proportion of our dataset that we want to be used as test data. Typically 0.2 (20%);
- `random_state` is an arbitrary value which, when set, ensures that the _random_ nature in which rows are picked to be in the test set is the same each time the split is carried out. In other words, the rows are picked at random, but we can ensure these random picks are repeatable by using the same value here. This makes it easier to assess model performance across iterations.

In [None]:
#  Call the train_test_split function:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=50)

Plotting the data points in each of the training and testing sets in different colours, we should be able to see that we have a similar spread of data in each

In [None]:
# Plot the results
plt.scatter(X_train, y_train, color='green', label='Training')  # plot the training data in green
plt.scatter(X_test, y_test, color='darkblue', label='Testing')  # plot the testing data in blue
plt.legend()
plt.show()

## Advanced plotting
Let's try and create something a little more visually appealing than the two plots above.
​
- We'll plot both dependent data series on the same graph;
- We'll assign two separate y-axes: one for each series;
- We'll display a legend near the top of the plot.

In [None]:
rc('mathtext', default='regular')
# Create blank figure
fig = plt.figure()

# Split figure to allow two sets of y axes
ax = fig.add_subplot(111)

# Plot the first line on its axis
ax.plot(np.arange(len(df.Y)), df.Y, '-', label = 'ZAR/USD', color='orange')

# Create second y axis and plot second line
ax2 = ax.twinx()
ax2.plot(np.arange(len(df.X)), df.X, '-', label = 'Exports (ZAR)')

# Add legends for each axis
ax.legend(loc=2)
ax2.legend(loc=9)

ax.grid()

# Set labels of axes
ax.set_xlabel("Months")
ax.set_ylabel("ZAR/USD")
ax2.set_ylabel("Exports (ZAR, millions)")
plt.show()

### Step 3: Select the Type of Statistical Model
Statistical models can be broadly categorized as:

- **Descriptive Models**: Summarize data patterns.
- **Inferential Models**: Help make inferences about the population.
- **Predictive Models**: Used to predict future outcomes based on historical data.
- **Prescriptive Models**: Suggest actions based on predictions.

Let's go through common types of statistical models and their applications.

# Regression Analysis
 
Regression Analysis is a statistical method to analyze the relationship between a dependent variable and one or more independent variables.

Use regression analysis for one of two purposes: 
- predict the value of the dependent variable when you know the independent variables or 
- predict the effect of an independent variable on the dependent variable.

### Types of regression analysis

- **Simple linear regression**
    - Assumes a linear connection between a dependent variable (Y) and an independent variable (X).
    - linear regression model can be simple 
        - with only one dependent and one independent variable.
    - A real estate agent wants to determine the relationship between the size of a house (in square feet) and its selling price. They can use simple linear regression to predict the selling price of a house based on its size.
    
-  **Multiple Linear Regression / Multivariate Linear Regression**
    - Assumes a linear connection between a dependent variable (Y) and an independent variable (X).
    - linear regression model can be complex 
        - with numerous dependent and independent variables
        - with one dependent variable and more than one independent variable.
    - A car manufacturer wants to predict the fuel efficiency of their vehicles based on various independent variables such as engine size, horsepower, and weight.
    
- **Logistic regression**
    - Used When the dependent variable is discrete.
        - the target variable can take on only one of two values, 
    - The sigmoid curve represents its connection to the independent variable, and probability has a value between 0 and 1.
    - A bank wants to predict whether a customer will default on their loan based on their credit score, income, and other factors. By using logistic regression, the bank can estimate the probability of default and take appropriate measures to minimize their risk.

- **Polynomial Regression**
    - Represents a non-linear relationship between dependent and independent variables. 
    - This technique is a variant of the multiple linear regression model, but the best fit line is curved rather than straight.

- **Ridge Regression**
    - Applied when the independent variables are highly correlated.
        - When data exhibits multicollinearity
    - While least squares estimates are unbiased in multicollinearity, their variances are significant enough to cause the observed value to diverge from the actual value. 
    - Ridge regression reduces standard errors by biassing the regression estimates.
    - The lambda (λ) variable in the ridge regression equation resolves the multicollinearity problem.

- **Lasso Regression**
    - Lasso regression (Least Absolute Shrinkage and Selection Operator) technique penalizes the absolute magnitude of the regression coefficient. 
    - The lasso regression technique employs variable selection, which leads to the shrinkage of coefficient values to absolute zero.

- **Quantile Regression**
    - The quantile regression approach is a subset of the linear regression technique. 
    - Statisticians and econometricians employ quantile regression when linear regression requirements are not met or when the data contains outliers.

- **Bayesian Linear Regression**
    - Machine learning utilizes Bayesian linear regression, a form of regression analysis, to calculate the values of regression coefficients using Bayes’ theorem. 
    - Rather than determining the least-squares, this technique determines the features’ posterior distribution.
    - The approach outperforms ordinary linear regression in terms of stability. 

- **Principal Components Regression**
    - Multicollinear regression data is often evaluated using the principle components regression approach. 
    - The significant components regression approach, like ridge regression, reduces standard errors by biassing the regression estimates. 
    - First, principal component analysis (PCA) modifies the training data, and then the resulting transformed samples train the regressors.

- **Partial Least Squares Regression**
    - The partial least squares regression technique is a fast and efficient covariance-based regression analysis technique. 
    - It is advantageous for regression problems with many independent variables with a high probability of multicollinearity between the variables. 
    - The method reduces the number of variables to a manageable number of predictors, then uses them in regression.

- **Elastic Net Regression**
    - Elastic net regression combines ridge and lasso regression techniques that are particularly useful when dealing with strongly correlated data. 
    - It regularizes regression models by utilizing the penalties associated with the ridge and lasso regression methods.


### Complete Workflow for Regression Modeling
Steps of a regression modeling process, covering:
- Exploratory Data Analysis (EDA), 
- assumption checking, 
- data transformations, 
- model fitting, and 
- interpretation.

**Step 1: Problem Definition and Data Understanding**

1. Define the Problem:
- Identify the dependent (response) variable and independent (predictor) variables.
- Clarify objectives
    - prediction, 
    - inference,
    - explanation.
2. Understand the Data:
- Review the dataset's structure, variable types, and context.

In [None]:
# missing values
def missing_values_table(df):
        mis_val = df.isnull().sum()
        
        mis_val_percent = 100 * df.isnull().sum() / len(df)
        
        mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
        
        mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0 : 'Missing Values', 1 : '% of Total Values'})
        
        mis_val_table_ren_columns = mis_val_table_ren_columns[
            mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
        
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"      
            "There are " + str(mis_val_table_ren_columns.shape[0]) +
              " columns that have missing values.")
        
        # Return the dataframe with missing information
        return mis_val_table_ren_columns

train.head()
train.info()
train.shape

missing_values_table(train)

**Step 2: Exploratory Data Analysis (EDA)**

1. Summary Statistics:
Compute 
- mean, 
- median, 
- standard deviation, and 
- correlations.

In [None]:
print(df.describe())
print(df.corr())

2. Visualization:
- Histogram for distributions.
- Scatter plots for relationships.
- Box plots for detecting outliers.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.pairplot(df, diag_kind="kde")
plt.show()


3. Check Multicollinearity:

- Compute the Variance Inflation Factor (VIF).

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

X = df[["SquareFootage", "Bedrooms", "LocationIndex"]]
vif = pd.DataFrame()
vif["Features"] = X.columns
vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif)


**Step 3: Preprocessing and Transformations**

1. Handle Missing Data:
- Impute missing values or drop rows/columns.

In [None]:
df.fillna(df.median(), inplace=True)

2. Encode Categorical Variables:
- Use one-hot encoding or label encoding.

In [None]:
df = pd.get_dummies(df, columns=["Location"], drop_first=True)

3. Feature Scaling:
- Standardize or normalize numerical variables.

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df_scaled = scaler.fit_transform(df[["SquareFootage", "Bedrooms"]])

4. Transform Non-linear Relationships:
- Apply log, Box-Cox, or square root transformations for skewed variables.

In [None]:
from scipy.stats import boxcox

df["LogPrice"] = np.log(df["Price"])
df["BoxCoxPrice"], _ = boxcox(df["Price"])

#### Common Data Cleaning Issues and How to Handle Them
Data cleaning is a critical step in data preprocessing to ensure that datasets are accurate, consistent, and reliable before analysis or modeling. 

Data cleaning is an iterative process. It ensures the dataset is:
- ✅ Complete (No missing values)
- ✅ Accurate (No typos or incorrect entries)
- ✅ Consistent (Standardized formats & categories)
- ✅ Reliable (No data leakage or inconsistencies)

##### **Missing Data**
📌 Issue: Some values in the dataset are missing, which can affect analysis and machine learning models.

🔹 Causes:
- Data entry errors
- Sensor failures
- Non-response in surveys

🔹 Solutions:
- Remove rows/columns with excessive missing values (if the proportion is too high).
- Impute missing values using:
- Mean/Median (for numerical data)
- Mode (for categorical data)
- Forward/Backward Fill (for time-series data)
- KNN or Regression Imputation (for advanced cases)

In [None]:
import pandas as pd
from sklearn.impute import SimpleImputer

# Sample data
data = {'Age': [25, 30, None, 40, 35], 'Salary': [50000, None, 60000, 65000, None]}
df = pd.DataFrame(data)

# Impute missing values with mean
imputer = SimpleImputer(strategy='mean')
df[['Age', 'Salary']] = imputer.fit_transform(df[['Age', 'Salary']])

print(df)


##### **Duplicate Records**
📌 Issue: Duplicate rows in the dataset can distort results and inflate counts.

🔹 Causes:
- Data entry errors
- Merging multiple datasets
- Web scraping issues

🔹 Solutions:
- Identify and remove exact duplicates using .drop_duplicates().
- Use domain knowledge to define duplicates when small variations exist.

In [None]:
df = df.drop_duplicates()

##### **Inconsistent Data Formats**
📌 Issue: Data is stored in different formats (e.g., date formats like MM-DD-YYYY vs. YYYY-MM-DD).

🔹 Causes:
- Different data sources
- Human input errors

🔹 Solutions:
- Standardize formats (e.g., convert all dates to YYYY-MM-DD).
- Use regex or string functions to clean inconsistencies.

In [None]:
df['date'] = pd.to_datetime(df['date'], errors='coerce')

##### **Outliers and Anomalies**
📌 Issue: Some values are significantly different from the rest, which can distort statistical analysis.

🔹 Causes:
- Data entry errors
- Genuine extreme events

🔹 Solutions:
- Detect outliers using:
- Z-score (values beyond ±3 standard deviations)
- IQR (Interquartile Range) method
- Handle outliers by:
- Capping values at a reasonable threshold
- Using robust statistical models

In [None]:
Q1 = df['Age'].quantile(0.25)
Q3 = df['Age'].quantile(0.75)
IQR = Q3 - Q1

df = df[(df['Age'] >= (Q1 - 1.5 * IQR)) & (df['Age'] <= (Q3 + 1.5 * IQR))]


##### **Incorrect or Inconsistent Categorical Values**
📌 Issue: Categorical data may have misspellings or inconsistent labels ("Male", "male", "M")

🔹 Causes:
- Human entry errors
- Multiple data sources

🔹 Solutions:
- Standardize text formats (e.g., lowercase everything).
- Use fuzzy matching for minor spelling errors.
- Manually map inconsistent categories to standardized ones.

In [None]:
df['Gender'] = df['Gender'].str.lower().replace({'m': 'male', 'f': 'female'})

##### **Structural Errors (Typos, Extra Spaces, Incorrect Encoding)**
📌 Issue: Inconsistent spelling or encoding errors ("New York " vs. "NewYork")

🔹 Causes:
- Data entry errors
- Copy-pasting issues

🔹 Solutions:
- Remove extra spaces using .strip().
- Fix encoding issues with .encode().decode().

In [None]:
df['City'] = df['City'].str.strip()

##### **Data Leakage**
📌 Issue: The dataset contains information that should not be available during model training.

🔹 Causes:
- Using future data in training
- Including target-related variables as predictors

🔹 Solutions:
- Remove features that leak information (e.g., future sales data in a forecasting model).
- Ensure a proper train-test split to prevent future data from influencing training.

##### **Data Type Mismatch**
📌 Issue: Columns have incorrect data types (e.g., numeric values stored as strings).

🔹 Causes:
- Importing data from CSVs or Excel
- Mixed data types in a column

🔹 Solutions:
- Convert data types explicitly using .astype().
- Handle conversion errors with errors='coerce'

In [None]:
df['Salary'] = pd.to_numeric(df['Salary'], errors='coerce')

##### **Imbalanced Data**
📌 Issue: In classification tasks, one class dominates (90% fraud-free, 10% fraud cases).

🔹 Causes:
- Rare event classification (e.g., fraud detection)

🔹 Solutions:
- Oversample the minority class (e.g., SMOTE technique).
- Undersample the majority class to balance the dataset.
- Use weighted models that adjust for class imbalance.

In [None]:
from imblearn.over_sampling import SMOTE
X_resampled, y_resampled = SMOTE().fit_resample(X, y)

**Step 4: Model Fitting and Assumption Checking**
1. Fit the Regression Model:

In [None]:
import statsmodels.api as sm

X = sm.add_constant(df[["SquareFootage", "Bedrooms"]])  # Add intercept
Y = df["Price"]
model = sm.OLS(Y, X).fit()
print(model.summary())

2. Check Model Assumptions:

(a) Linearity: 

Residuals vs. Fitted Plot: Look for randomness.

In [None]:
plt.scatter(model.fittedvalues, model.resid)
plt.axhline(0, color="red", linestyle="--")
plt.title("Residuals vs. Fitted")
plt.xlabel("Fitted Values")
plt.ylabel("Residuals")
plt.show()

(b) Normality of Residuals:

Use a histogram and Q-Q plot.

In [None]:
import scipy.stats as stats

stats.probplot(model.resid, dist="norm", plot=plt)
plt.title("Q-Q Plot")
plt.show()

(c) Homoscedasticity:

Breusch-Pagan test.

In [None]:
from statsmodels.stats.diagnostic import het_breuschpagan

bp_test = het_breuschpagan(model.resid, X)
print(f"p-value: {bp_test[1]}")

(d) Multicollinearity:

Variance Inflation Factor (VIF) as shown earlier.

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

X = df[["SquareFootage", "Bedrooms", "LocationIndex"]]
vif = pd.DataFrame()
vif["Features"] = X.columns
vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif)

**Step 5: Address Issues and Refine the Model**

1. Linearity:

Transform variables if residual plots show non-linearity.

In [None]:
df["SquareFootage_Sq"] = df["SquareFootage"] ** 2

2. Non-Normal Residuals:

Apply transformations to the dependent variable.

In [None]:
df["LogPrice"] = np.log(df["Price"])
model_log = sm.OLS(df["LogPrice"], X).fit()

3. Heteroscedasticity:

Use Weighted Least Squares (WLS).

In [None]:
weights = 1 / (model.resid ** 2)
model_wls = sm.WLS(Y, X, weights=weights).fit()

4. Multicollinearity:
- Drop or combine highly correlated variables.
- Use PCA, Ridge, or Lasso regression.

In [None]:
from sklearn.linear_model import Ridge

ridge = Ridge(alpha=1.0)
ridge.fit(X, Y)

**Step 6: Evaluate Model Performance**
- Metrics: 
    - $𝑅^2$
    - Adjusted $𝑅^2$
    - RMSE, 
    - MAE.

- Residual Plots:
    - Confirm residuals are normally distributed and homoscedastic.

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error

rmse = np.sqrt(mean_squared_error(Y, model.predict(X)))
mae = mean_absolute_error(Y, model.predict(X))
print(f"RMSE: {rmse}, MAE: {mae}")

**Step 7: Interpretation and Communication**

Coefficient Interpretation:
- For each predictor, interpret its coefficient in terms of the dependent variable.

Confidence Intervals:
- Report 95% confidence intervals for coefficients.

Visualize Results:

In [None]:
import seaborn as sns

sns.regplot(x="SquareFootage", y="Price", data=df, line_kws={"color": "red"})
plt.title("Regression Line: Square Footage vs. Price")
plt.show()

# Linear Regression

Linear Regression is a supervised learning algorithm used to model the relationship between a dependent variable (outcome) and one or more independent variables (predictors).
- Predicts the relationship between two variables by assuming they have a straight-line connection. 

Linear Regression predicts a continuous target variable (e.g., the number of readmissions) by minimizing the residual sum of squares between observed and predicted values.
- It finds the best line that minimizes the differences between predicted and actual values.

## 1. Simple Linear Regression

In a simple linear regression, there is 
- one independent variable and 
- one dependent variable. 

The model estimates the slope and intercept of the line of best fit, which represents the relationship between the variables. 
- The slope represents the change in the dependent variable for each unit change in the independent variable, while 
- The intercept represents the predicted value of the dependent variable when the independent variable is zero.

What It Means: 
- Linear regression models the relationship between a dependent variable and one or more independent variables by fitting a linear equation to observed data. 
- It assumes a straight-line relationship. 
- It shows the linear relationship between the independent(predictor) variable i.e. X-axis and the dependent (output) variable i.e. Y-axis, 
    - called linear regression.

- It is employed to establish a link between a dependant variable and a single independent variable. 
    - A linear equation defines the relationship, with the 
        - slope and 
        - intercept 
    - of the line representing the effect of the independent variable on the dependant variable.
        - An independent variable is the variable that is controlled in a scientific experiment to test the effects on the dependent variable.
        - A dependent variable is the variable being measured in a scientific experiment.

Outcome Interpretation: 
- Each coefficient represents how much the dependent variable (outcome) changes when the predictor variable changes by one unit, keeping all else constant.

**Assumptions of Linear Regression**

Regression is a parametric approach, which means that it makes assumptions about the data

For successful regression analysis, it’s essential to validate the following assumptions.

- Linearity (Linear Relationship): The relationship between the predictors and the outcome is linear.
    - Plot dependent variable and independent variable(s) and see linear relationship.
- Independence of Errors: Residuals (errors) are independent of each other.
    - The error terms should not be dependent on one another (like in time-series data wherein the next value is dependent on the previous one). 
    - There should be no correlation between the residual terms.
    - The absence of this phenomenon is known as Autocorrelation.
- No or Little Autocorrelation
- Normality of Errors: Residuals are normally distributed.
    - The mean of residuals should follow a normal distribution with a mean equal to zero or close to zero. 
    - This is done to check whether the selected line is the line of best fit or not. 
    - If the error terms are non-normally distributed, suggests that there are a few unusual data points that must be studied closely to make a better model.
- Multivariate Normality
- No or Little Multicollinearity
- Homoscedasticity: Variance of residuals is constant across all levels of predictors.
    - The error terms must have constant variance. 
    - The presence of non-constant variance in the error terms is referred to as Heteroscedasticity. 

Performance Measures:
- R-squared: Indicates the proportion of the variance in the dependent variable explained by the independent variables. 
    - Values closer to 1 indicate a better fit.
- Mean Squared Error (MSE): The average squared difference between observed and predicted values; lower values are better.

Lay Explanation: 
- Think of linear regression like drawing a best-fit line through a scatterplot of data points, aiming to predict outcomes based on relationships in the data.
- Finds a relationship between independent and dependent variables by finding a “best-fitted line” that has minimal distance from all the data points.
- The algorithm explains the linear relationship between the dependent(output) variable y and the independent(predictor) variable X using a straight line

Use Case: 
- When there is a linear relationship between the target and predictor variables.

### Mathematics or Linear Regression

- it is using the least square method finds a linear equation that minimizes the sum of squared residuals (SSR).
- Cost Function:

$ J(\theta) = \frac{1}{2m}\sum^{m}_{i=1}(h_{\theta}(x^{(i)})- y^{(i)})^{2}$

Model Equation:
$ 𝑦=𝛽_{0}+𝛽_{1}𝑥_{1}+…+𝛽_{𝑛}𝑥_{𝑛}+ 𝜖 $

where:
- $y$ = dependent variable
- $𝛽_{0}$ = Y intercept / constant
- $𝛽_{1}$ = Slope coefficient / intercept
- $𝑥_{1}$ = independent variable
- $𝜖 $ = error term

**What is Cost Function ?**

The goal of the linear regression algorithm is to get the best values for $𝛽_{0}+𝛽_{1}$ to find the **best-fit line**.
- is a line that has the least error which means the error between predicted values and actual values should be minimum.

A cost function, also referred to as a: 
- loss function : Used when we refer to the error for a single training example. 
- objective function : Used to refer to an average of the loss functions over an entire training dataset.
It quantifies the difference between predicted and actual values, serving as a metric to evaluate the performance of a model.

Objective 
- is to minimize the cost function, indicating better alignment between predicted and observed outcomes.
- Guides the model towards optimal predictions by measuring its accuracy against the training data.

AKA - Random Error (Residuals)
- the difference between the observed value of the dependent variable($y_{i}$) and the predicted value(predicted) is called the residuals.
    - $𝜖_{i}$ =  $y_{predicted}  –  y_{i}$

where $𝑦_{predicted} = 𝛽_{0}+𝛽_{1}𝑥_{1}+…+𝛽_{𝑛}𝑥_{𝑛}+ 𝜖 $

**Why to use a Cost function**

Cost function helps us reach the optimal solution / work out the optimal values for $𝛽_{0}+𝛽_{1}$ . 
- How: It takes both predicted outputs by the model and actual outputs and calculates how much wrong the model was in its prediction.
    - It basically measures the discrepancy between the model’s predictions and the true values it is attempting to predict. 
    - This variance is depicted as a lone numerical figure, enabling us to measure the model’s **precision**.
- The cost function is the technique of evaluating “the performance of our algorithm/model”.

Classifiers have very high accuracy but one solution (Classifier) is the best because it does not misclassify any point.
- Reason why it classifies all the points perfectly is that the:
    - line is almost exactly in between the two (n) groups, and not closer to any one of the groups.

Explanation of the function of a cost function:

- Error calculation: It determines the difference between the predicted outputs (what the model predicts as the answer) and the actual outputs (the true values we possess for the data).
- Gives one value: This simplifies comparing the model’s performance on various datasets or training rounds.
- Improving Guides: The objective is to reduce the cost function. 
    - How: Through modifying the internal parameters of the model such as weights and biases, we can aim to minimize the total error and enhance the accuracy / precision of the model.

**Types of Cost function in machine learning**

Its use cases depend on whether it is a regression problem or classification problem.
- Regression cost Function
- Binary Classification cost Functions
- Multi-class Classification cost Functions

### Problem Context: Predicting Hospital Readmission Rates
The aim to reduce hospital readmission rates. 
- High readmission rates can strain resources and negatively impact patient outcomes.
- The goal is to predict the number of readmissions within 30 days of discharge for a particular condition, such as 
    - diabetes, based on 
        - patient demographic, 
        - clinical data, and 
        - treatment data.

**Step 1. Define the Problem**

We want to predict the number of readmissions ($𝑌$) using features ($𝑋$) such as:
- Patient age
- Length of hospital stay
- Severity of condition
- Medication adherence rate
- Comorbidities (e.g., hypertension, kidney disease)
- Number of follow-up visits scheduled

**Step 2. Collect and Prepare Data**

- Data Collection: Gather historical patient data from the hospital's database.
- Understand the 
    - model description
    - causality and 
    - directionality
- Check the data
    - categorical data, 
    - missing data and 
    - outliers
- Data Cleaning: 
    - Dummy variable takes only the value 0 or 1 to indicate the effect for categorical variables.
    - Handle missing values, 
    - remove duplicates, and 
    - correct errors.
    - Outlier is a data point that differs significantly from other observations. 
        - use standard deviation method and 
        - interquartile range (IQR) method.
- Feature Engineering: 
    - Encode categorical variables (e.g., age group), 
    - scale continuous variables (e.g., length of stay), and 
    - create interaction terms if necessary.

**Step 3. Conduct a Simple Analysis**
- Check the **effect** comparing between 
    - Dependent variable to independent variable and 
    - Independent variable to independent variable
- Check the correlation.
    - Use scatter plots
- Check Multicollinearity 
    - This occurs when more than two independent variables are highly correlated. 
    - Use Variance Inflation Factor (VIF) 
        - if VIF > 5 there is highly correlated and 
        - if VIF > 10 there is certainly multicollinearity among the variables.
- Interaction Term imply a change in the slope from one value to another value.

`Show the relationship between the two variables using a scatter plot.`
- We have our Y, our X, and time (months), but we're just trying to model ZAR/USD as a *function* of Exports. 
    - To see if we can see that there possibly exists a linear relationship between the two variables: Value of Exports and ZAR/USD.

In [None]:
plt.scatter(df['X'], df['Y'])
plt.ylabel("ZAR/USD")
plt.xlabel("Value of Exports (ZAR, millions)")
plt.show()

**Step 4. Formulate the Model (From Scratch)**
- y in this equation stands for the predicted value, 
- x means the independent variable and 
- m & b are the **coefficients** we need to optimize in order to fit the regression line to our data.

#### Finding the Best Fit Line
Let's say we have estimated some values for $a$ and $b$. We could plug in all of our values of X to find the corresponding values of Y. These *new* values of Y could be compared to the *actual* values of Y to assess the fit of the line. This becomes tedious as the number of data points increases.
   
Looking at the data, we can make a guess at the values of the slope and intercept of the line. We'll use a rough estimate of the slope as $\frac{rise}{run} = \frac{16}{80000} = 0.0002$. For the intercept, we'll just take a guess and call it $-3$.   
   
Let's plot a line with values of $a = -3$, and $b = 0.0002$:   
   
First, we will need to generate some values of y using the following formula:
   
$$\hat{y}_i = a + bx_i$$   



Calculating coefficient of the equation:
- To calculate the coefficients we need the formula for 

Covariance 

$Cov (X,Y) = \frac{\sum (X_{i}- X)(Y_{j} - Y)}{n}$

Variance

$var(x) = \frac{\sum^{n}_{i} (x_i -\mu)^2}{N}$

- To calculate the coefficient m
    - m = cov(x, y) / var(x)
    - b = mean(y) — m * mean(x)

**Functions to calculate the Mean, Covariance, and Variance.**

In [None]:
# mean 
def get_mean(arr):
    return np.sum(arr)/len(arr)

# variance
def get_variance(arr, mean):
    return np.sum((arr-mean)**2)

# covariance
def get_covariance(arr_x, mean_x, arr_y, mean_y):
    final_arr = (arr_x - mean_x)*(arr_y - mean_y)
    return np.sum(final_arr)

**Fuction to calculate the coefficients and the Linear Regression Function**

In [None]:
# Coefficients 
# m = cov(x, y) / var(x)
# b = y - m*x

def get_coefficients(x, y):
    x_mean = get_mean(x)
    y_mean = get_mean(y)
    m = get_covariance(x, x_mean, y, y_mean)/get_variance(x, x_mean)
    b = y_mean - x_mean*m
    return m, b

In [None]:
# Linear Regression 
# Train and Test
# Train Split 80 % Test Split 20 %
def linear_regression(x_train, y_train, x_test, y_test):
    prediction = []
    m, b = get_coefficients(x_train, y_train)
    for x in x_test:
        y = m*x + b
        prediction.append(y)
    
    r2 = r2_score(prediction, y_test)
    mse = mean_squared_error(prediction, y_test)
    print("The R2 score of the model is: ", r2)
    print("The MSE score of the model is: ", mse)
    return prediction

prediction = linear_regression(x[:80], y[:80], x[80:], y[80:])

In [None]:
# Define a function to generate values of y from a list of x, 
# Given parameters a and b

def gen_y(x_list, a, b):
    y_gen = []
    for x_i in x_list:
        y_i = a + b*x_i
        y_gen.append(y_i)
    
    return(y_gen)

# Generate the values by invoking the 'gen_y' function
y_gen = gen_y(df.X, -3, 0.0002)

# Plot the results
plt.scatter(df.X, df.Y)  # Plot the original data
plt.plot(df.X, y_gen, color='red')  # Plot the line connecting the generated y-values
plt.ylabel("ZAR/USD")
plt.xlabel("Value of Exports (ZAR, millions)")
plt.show()

**Visualize the regression line**

In [None]:
def plot_reg_line(x, y):
    # Calculate predictions for x ranging from 1 to 100
    prediction = []
    m, c = get_coefficients(x, y)
    for x0 in range(1,100):
        yhat = m*x0 + c
        prediction.append(yhat)
    
    # Scatter plot without regression line
    fig = plt.figure(figsize=(20,7))
    plt.subplot(1,2,1)
    sns.scatterplot(x=x, y=y)
    plt.xlabel('X')
    plt.ylabel('Y')
    plt.title('Scatter Plot between X and Y')
    
    # Scatter plot with regression line
    plt.subplot(1,2,2)
    sns.scatterplot(x=x, y=y, color = 'blue')
    sns.lineplot(x = [i for i in range(1, 100)], y = prediction, color='red')
    plt.xlabel('X')
    plt.ylabel('Y')
    plt.title('Regression Plot')
    plt.show()

In [None]:
# Regression plot form seaborn
# regplot is basically the combination of the scatter plot and the line plot
sns.regplot(x, y)
plt.xlabel('X')
plt.ylabel('Y')
plt.title("Regression Plot")
plt.show()

In [None]:
def plot_reg_line(x, y):
    # Calculate predictions for x ranging from 1 to 100
    prediction = []
    m, c = get_coefficients(x, y)
    for x0 in range(1,100):
        yhat = m*x0 + c
        prediction.append(yhat)
    
    # Scatter plot without regression line
    fig = plt.figure(figsize=(20,7))
    plt.subplot(1,2,1)
    sns.scatterplot(x=x, y=y)
    plt.xlabel('X')
    plt.ylabel('Y')
    plt.title('Scatter Plot between X and Y')
    
    # Scatter plot with regression line
    plt.subplot(1,2,2)
    sns.scatterplot(x=x, y=y, color = 'blue')
    sns.lineplot(x = [i for i in range(1, 100)], y = prediction, color='red')
    plt.xlabel('X')
    plt.ylabel('Y')
    plt.title('Regression Plot')
    plt.show()

**Step 4. Formulate the model and Fit the Model (using library)**

- Split the Data: Divide data into training and testing sets (e.g., 80% training, 20% testing).
- Train the Model: Use a library like sklearn in Python to fit the regression model on the training data.
- Evaluate the Model: Check metrics such as $𝑅^2$ (explained variance) and RMSE (Root Mean Squared Error).

In [None]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

# Example

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# Create the dataset
X = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]).reshape(-1, 1)
y = np.array([2, 4, 5, 7, 8, 10, 11, 13, 14, 16])

# Create the linear regression model
model = LinearRegression().fit(X, y)


##### Calculate the Regression Coefficients

Use the formulas for $𝛽_1$ (slope) and $𝛽_0$ (intercept):

$𝛽_1 = \frac{\sum (x_{i}- \bar{x})(y_{j} - \bar{y})}{\sum (x_{i}- \bar{x})^2}$

$𝛽_0 = \bar{y} - 𝛽_1 \bar{x}$

In [None]:
# Mean of x and y from scratch 
x_mean = np.mean(x)
y_mean = np.mean(y)

# Calculating beta1 (slope)
numerator = np.sum((x - x_mean) * (y - y_mean))
denominator = np.sum((x - x_mean) ** 2)
beta1 = numerator / denominator

# Calculating beta0 (intercept)
beta0 = y_mean - beta1 * x_mean

print(f"Beta0 (Intercept): {beta0:.3f}")
print(f"Beta1 (Slope): {beta1:.3f}")

In [None]:
# Get the slope and intercept of the line
slope = model.coef_
intercept = model.intercept_

# Plot the data points and the regression line
plt.scatter(X, y)
plt.plot(X, slope*X + intercept, color='red')
plt.show()

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Example dataset
X = data[['age', 'length_of_stay', 'severity', 'medication_adherence', 'comorbidities']]
y = data['readmissions']

# Split data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit the model
model = LinearRegression()
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Evaluation
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)

print(f"RMSE: {rmse}, R^2: {r2}")


**Let's check the calculted fit of the line** by measuring how far the true y-values of each point are from their corresponding y-value on the line.   
   
We'll use the equation below to calculate the error of each generated value of y:   
   
$$e_i = y_i - \hat{y}_i$$   

In [None]:
errors = np.array(df.Y - y_gen)
np.round(errors, 2)

In addition to having some very large errors, we can also see that most of the errors are positive numbers. Ideally, we want our errors to be evenly distributed either side of zero - we want our line to best fit the data, i.e. no bias.
   
We can measure the overall error of the fit by calculating the **Residual Sum of Squares**:
   
$$RSS = \sum_{i=1}^n(y_i-\hat{y}_i)^2$$

##### Residual Sum of Squares (RSS)
Definition: The Residual Sum of Squares (RSS) measures the discrepancy between the actual data points and the estimated values predicted by a regression model. It is calculated as the sum of the squared differences between actual ($𝑦_𝑖$) and predicted ($\hat{y}_𝑖 $) values.

The RSS finds the difference between the y-value of each data point and our estimated line (which may be either negative or positive), squares the difference, and then adds all the differences up. In other words, it's the sum of the squares of all the errors we calculated before.

Here:

- $𝑦_𝑖$ = Actual value of the dependent variable for observation 𝑖.
- $\hat{y}_𝑖 = 𝛽_0 + 𝛽_1 𝑥_𝑖$ , where:
    - $𝛽_0$ is the intercept.
    - $𝛽_1$ is the slope of the regression line.
    - $𝑥_𝑖$ is the value of the independent variable for observation 𝑖.

Substituting $\hat{y}_𝑖$:

$$RSS = \sum_{i=1}^n(y_i-(𝛽_0 + 𝛽_1 𝑥_𝑖))^2$$

The RSS quantifies the "unexplained variance" by the model.

In a simple linear regression, minimizing RSS is equivalent to finding the best-fit line.

In [None]:
# Residual sum of squares from scratch
rss = np.sum((y - y_pred) ** 2)
print(f"Residual Sum of Squares (RSS): {rss:.3f}")

In [None]:
print("Residual sum of squares:", (errors ** 2).sum())

## Least Squares Method
Least Squares is another method that allows us to find the line of best fit while enforcing the constraint of minimising the residuals. More specifically, the **Least Squares Criterion** states that the sum of the squares of the residuals should be minimized, i.e.   
$$Q = \sum_{i=1}^n(y_i-\hat{y}_i)^2$$

The formulae for the intercept, $a$, and the slope, $b$, are determined by minimizing the equation for the sum of the squared prediction errors:   
$$Q = \sum_{i=1}^n(y_i-(a+bx_i))^2$$

Optimal values for $a$ and $b$ are found by differentiating $Q$ with respect to $a$ and $b$, setting both equal to 0 and then solving for $a$ and $b$.   
   
We won't go into the [derivation process](http://seismo.berkeley.edu/~kirchner/eps_120/Toolkits/Toolkit_10.pdf) here, but the equations for $a$ and $b$ are:   
   
$$b = \frac{\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^n(x_i-\bar{x})^2}$$   
   
and:   
   
$$a = \bar{y} - b\bar{x}$$

where:
- $ x_i$ Values of the independent variable.
- $ y_i$ Values of the dependent variable.
- $\bar{y}$ are the mean values of $y$.
- $\bar{x}$ are the mean values of $x$ in our dataset, respectively.

### Interpreting least-squares coefficients

Interpreting the least-squares coefficients provides insights into the relationship between the independent variable (x) and the dependent variable (y) in a simple linear regression.

#### The Slope ($𝛽_1$)

Interpretation:

- If $𝛽_1 > 0$: y increases as x increases (positive relationship).
- If $𝛽_1 < 0$: y decreases as x increases (negative relationship).
- If $𝛽_1 = 0$: No linear relationship exists between x and y.

If $𝛽_1 = 0.28$ this means that for every one-unit increase in x, y is expected to increase by 0.28 units.

Key Considerations:

- The magnitude of $𝛽_1$ indicates the strength of the effect.
- The direction (+/-) indicates the nature of the relationship.

#### The Intercept ($𝛽_0$)

Definition:
The intercept ($𝛽_0$) represents the predicted value of the dependent variable (y) when the independent variable (x) is zero.

Interpretation:
- The intercept gives a baseline value of y when x=0.
- It is meaningful only if 𝑥=0 is within the range of observed data. 
    - If not, the intercept might be extrapolated and have limited interpretive value.

Limitations
- Causation vs. Correlation: The coefficients indicate relationships, not causation, unless you have a well-controlled experimental design.
- Range of x: The interpretation of $𝛽_0$ and $𝛽_1$ applies only within the range of observed x-values.
- Other Factors: The model assumes that other variables do not influence 𝑦, which might not be the case in real-world scenarios.

In [None]:
X = df.X.values
Y = df.Y.values

# Calculate x bar, y bar
x_bar = np.mean(X)
y_bar = np.mean(Y)

# Calculate slope
b = sum( (X-x_bar)*(Y-y_bar) ) / sum( (X-x_bar)**2 )

# Calculate intercept
a = y_bar - b*x_bar

print("Slope = " + str(b))
print("Intercept = " + str(a))

In [None]:
# Use the function we created earlier:
# it generates y-values for given x-values based on parameters a, b
y_gen2 = gen_y(df.X, a, b)

plt.scatter(df.X, df.Y)
plt.plot(df.X, y_gen2, color='red')
plt.show()

In [None]:
errors2 = np.array(y_gen2 - df.Y)
print(np.round(errors2, 2))

In [None]:
print("Residual sum of squares:", (errors2 ** 2).sum())

Here we can see our RSS has improved from ~867 down to ~321.  
Furthermore, if we calculate the sum of the errors we find that the value is close to 0.

----
Intuitively, this should make sense as it is an indication that the sum of the positive errors is equal to the sum of the negative errors. The line fits in the 'middle' of the data.

In [None]:
# Round off to 11 decimal places
np.round(errors2.sum(),11)

##### Recognise the Standard error of a statistic

The standard error (SE) of a statistic in linear regression quantifies the variability of the estimated coefficients ($𝛽_0$ and $𝛽_1$) and other regression outputs. 
- It measures how much the coefficient estimates are expected to vary from sample to sample due to random noise in the data.

**Standard Error of the Regression Coefficients**

For a coefficient $𝛽_𝑗$ , the standard error (𝑆𝐸_𝛽_𝑗) is calculated as:

$$𝑆𝐸_𝛽 = \sqrt{\frac{ \sigma{^2}}{\sum{}(x_i-\bar{x})^2}}$$

Where:

- $\sigma{^2}$: The variance of the residuals, often estimated as the mean squared error (MSE):
$$\sigma{^2} = \frac{RSS}{n−2}$$
- n is the number of observations.

$ \sum{}(𝑥_𝑖 − \bar{𝑥})^2$ : The total variation in the independent variable 𝑥

**Standard Error of the Regression**

The standard error of the regression (also called the residual standard error, $𝑅𝑆𝐸$ measures the average distance that the observed values fall from the regression line.

$$RSE = \frac{RSS}{n−2}$$

Where:

- RSS: Residual Sum of Squares.
- $𝑛 − 2$ : Degrees of freedom for simple linear regression ($𝑛 − 𝑘 − 1$)
    - with 𝑘 = 1 predictor.

##### Role of Standard Errors in Linear Regression

Done to evaluate the reliability and precision of your regression model.

1.  Coefficient Standard Errors ($𝑆𝐸_{𝛽_0}$ and $𝑆𝐸_{𝛽_1}$
- These are used to:
    - Quantify Precision: Smaller standard errors indicate more precise estimates of the coefficients. 
    - Construct Confidence Intervals: The confidence interval for $𝛽_𝑗$ is:

$$𝛽_𝑗 \pm t \cdot 𝑆𝐸_{𝛽_j}$$

where 𝑡 is the critical value from the t-distribution for the desired confidence level.
 
​- Perform Hypothesis Tests: To test if $𝛽_𝑗$ = 0, we calculate:

$$t = \frac{𝛽_𝑗}{SE_{𝛽_𝑗}}$$

Compare t to the critical t-value to determine significance.

2. Residual Standard Error (𝑅𝑆𝐸) 
- Indicates the average error in predictions.
- Provides a baseline for assessing the fit of the model (smaller 𝑅𝑆𝐸 implies a better fit).
 
**Intepretation**

Residual Standard Error (RSE):
- On average, the observed y-values deviate from the predicted  y-values by 0.147 units.

Standard Error of Slope ($SE_{β_1}$):
- The variability in the estimated slope is 0.065. This is used to assess the precision of $𝛽_1.

Confidence in Coefficients:
- Smaller standard errors indicate more confidence in the coefficient estimates.
- Standard errors also allow hypothesis testing to determine if a predictor has a statistically significant impact on y.

In [None]:
import numpy as np

# Step 1: Define data
x = np.array([1, 2, 3, 4, 5])  # Independent variable
y = np.array([2.2, 2.8, 3.6, 4.5, 5.1])  # Dependent variable

# Step 2: Calculate coefficients
x_mean = np.mean(x)
y_mean = np.mean(y)
beta1 = np.sum((x - x_mean) * (y - y_mean)) / np.sum((x - x_mean) ** 2)
beta0 = y_mean - beta1 * x_mean

# Step 3: Calculate residuals and RSS
y_pred = beta0 + beta1 * x  # Predicted values
residuals = y - y_pred  # Residuals
RSS = np.sum(residuals ** 2)  # Residual Sum of Squares

# Step 4: Calculate standard error of the regression (RSE)
n = len(x)  # Number of observations
RSE = np.sqrt(RSS / (n - 2))  # Residual Standard Error

# Step 5: Calculate standard error of the slope (SE_beta1)
SE_beta1 = RSE / np.sqrt(np.sum((x - x_mean) ** 2))

# Step 6: Print results
print(f"Residual Standard Error (RSE): {RSE:.3f}")
print(f"Standard Error of Slope (SE_beta1): {SE_beta1:.3f}")


### Applications of Standard Errors

The standard errors (SEs) in linear regression are used to assess the precision and reliability of the estimated coefficients and model predictions. They serve as the foundation for key inferential techniques like 
- confidence intervals, 
- hypothesis testing, and 
- evaluating the overall fit of the regression model.

1. Constructing Confidence Intervals

Confidence intervals provide a range of plausible values for the regression coefficients.

$$𝛽_𝑗 \pm t \cdot 𝑆𝐸_{𝛽_j}$$

where:
- $𝛽_𝑗$: Estimated coefficient.
- $𝑆𝐸_{𝛽_j}$: Standard error of the coefficient.
- t: Critical value from the t-distribution based on the desired confidence level and degrees of freedom ($𝑛 − 𝑘 − 1$).

Interpretation: If the confidence interval for a coefficient does not include 0, it indicates that the predictor variable has a statistically significant relationship with the dependent variable at the given confidence level.

**Calculate the 95% confidence interval for a regression coefficient, such as slope($𝛽_1$)**

Use the following formula:

Confidence Interval = $$𝛽_1 \pm t \cdot 𝑆𝐸_{𝛽_1}$$

Steps to Calculate the 95% Confidence Interval

1. Estimate the Slope Coefficient ($𝛽_1$)

$$𝛽_1 = \frac{\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^n(x_i-\bar{x})^2}$$

2. Calculate the Standard Error of the Slope ($𝛽_1$)

- The standard error of the slope is:

$$𝑆𝐸_{𝛽_1} = \frac{ RSE}{\sqrt{\sum{}(x_i-\bar{x})^2}}$$

- RSE (Residual Standard Error) is:

$$RSE = \frac{RSS}{n−2}$$
where RSS = $\sum{} (y_i-\hat{y_1})^2$

3. Find the Critical t-value ($𝑡_{critical}$):

Use the t-distribution with 𝑛 − 2 degrees of freedom to find the critical value for the 95% confidence level ($𝑡_{critical}$).

4. Apply the Confidence Interval Formula:

- Combine the values:
    - Confidence Interval = $$𝛽_1 \pm 𝑡_{critical} \cdot 𝑆𝐸_{𝛽_1}$$

_______

2. Hypothesis Testing

Hypothesis testing in Linear Regression
- Once you have fitted a straight line on the data, you need to ask, 
    - “Is this straight line a significant fit for the data?” Or 
    - “Is the beta coefficient explain the variance in the data plotted?” 
- Here comes the idea of hypothesis testing on the beta coefficient:

$H_0 : B_1  = 0$
    
$H_A : B_1  ≠ 0$

Interpret the Regression Equation
- The coefficients ($𝛽$) indicate the magnitude and direction of the relationship between each predictor and readmissions.
    - Example: A coefficient of -0.5 for medication_adherence means that for every 1% increase in medication adherence, readmissions decrease by 0.5.
- The intercept ($𝛽_0$) represents the expected number of readmissions when all predictors are zero.

Assessing the Model Fit
- Other parameters to assess a model are:
    - t statistic: It is used to determine the p-value and hence, helps in determining whether the coefficient is significant or not
    - F statistic: It is used to assess whether the overall model fit is significant or not. 
        - the higher the value of the F-statistic, the more significant a model turns out to be.

To determine whether a predictor variable has a significant impact on the dependent variable, use hypothesis testing.
- Null Hypothesis ($𝐻_0): 𝛽_𝑗 =0$ (the predictor has no effect on response(𝑦) varaible).
- Alternative Hypothesis $(𝐻_𝑎): 𝛽_𝑗 ≠ 0$ (the predictor has an effect/ there is a relationship).
- t-statistic:

**How to Calculate the t-statistic in Linear regression**
The t-statistic in linear regression measures how many standard errors the estimated coefficient is away from zero. 
- It is used for hypothesis testing to determine if a predictor variable is statistically significant.

The formula to calculate the t-statistic for a coefficient

$$t = \frac{𝛽_𝑗}{SE_{𝛽_𝑗}}$$

Where:
- $𝛽_𝑗$: Estimated coefficient (e.g., slope or intercept).
- $SE_{𝛽_𝑗}$: Standard error of the estimated coefficient.

If the t-statistic is large in magnitude, it indicates that $𝛽_j (or β_1 in this case) is far from zero, suggesting the predictor has a significant effect on the dependent variable.

- P-Value: Compare the computed t-value to the critical value from the t-distribution, or calculate the p-value:
    - If $𝑝 < 𝛼 (e.g., 0.05)$, reject $𝐻_0$ and conclude the predictor is statistically significant.
_________

3. Evaluating Model Fit

The standard error of the regression (Residual Standard Error, 𝑅𝑆𝐸) assesses the accuracy of the model's predictions.

$$RSE = \sqrt{\frac{RSS}{n−k-1}}$$

**Degrees of Freedom**
The t-statistic follows a t-distribution with $𝑛 − 𝑘 − 1 degrees of freedom,

where:
- RSS: Residual sum of squares.
- n: Number of observations.
- k: Number of predictors (excluding the intercept).

Interpretation:

- A smaller 𝑅𝑆𝐸 indicates better model fit.
- Used as a baseline to evaluate other models.

________

4. Comparing Predictors

Standard errors help compare the relative importance of different predictors by normalizing their coefficient estimates.
- Predictors with smaller $𝑆𝐸{𝛽_𝑗}$ have more stable effects on 𝑦.
- Variables with larger $𝑆𝐸{𝛽_𝑗}$ might need further investigation (e.g., multicollinearity).

In [None]:
############## Applications of Standard Errors

import numpy as np
import scipy.stats as stats

# Step 1: Define data
x = np.array([1, 2, 3, 4, 5])
y = np.array([2.2, 2.8, 3.6, 4.5, 5.1])

# Step 2: Calculate coefficients
x_mean = np.mean(x)
y_mean = np.mean(y)
beta1 = np.sum((x - x_mean) * (y - y_mean)) / np.sum((x - x_mean) ** 2)
beta0 = y_mean - beta1 * x_mean

# Step 3: Calculate residuals and RSS
y_pred = beta0 + beta1 * x
residuals = y - y_pred
RSS = np.sum(residuals ** 2)

# Step 4: Calculate RSE
n = len(x)
RSE = np.sqrt(RSS / (n - 2))

# Step 5: Calculate standard error of the slope (SE_beta1)
SE_beta1 = RSE / np.sqrt(np.sum((x - x_mean) ** 2))

# Step 6: Hypothesis Testing and Confidence Interval
t_stat = beta1 / SE_beta1  # t-statistic
p_value = 2 * (1 - stats.t.cdf(abs(t_stat), df=n-2))  # Two-tailed test

# Confidence Interval for beta1
t_critical = stats.t.ppf(0.975, df=n-2)  # 95% confidence level
conf_interval = (beta1 - t_critical * SE_beta1, beta1 + t_critical * SE_beta1)

# Step 7: Print results
print(f"Coefficient (beta1): {beta1:.3f}")
print(f"Standard Error (SE_beta1): {SE_beta1:.3f}")
print(f"t-Statistic: {t_stat:.3f}")
print(f"p-Value: {p_value:.5f}")
print(f"95% Confidence Interval for beta1: {conf_interval}")


Interpretation
- Coefficient (𝛽_1): The slope is 0.7
    - 0.7, indicating that y increases by 0.7 units for every one-unit increase in 𝑥 
- Standard Error (𝑆𝐸_{𝛽_1): The slope estimate has a variability of 0.094, indicating precision.
- t-Statistic and p-Value: The large t-statistic and small p-value indicate that 𝛽_1 is statistically significant.
- Confidence Interval: We are 95% confident that the true value of 𝛽_1 lies between  0.467 and 0.933.

Summary

- Confidence Intervals: Quantify the uncertainty around coefficient estimates.
- Hypothesis Testing: Assess the statistical significance of predictors.
- Model Diagnostics: Evaluate and compare models using 𝑅𝑆𝐸.
- Decision-Making: Use SEs to identify reliable predictors and improve the model.

In [None]:
############ calculate the 95% confidence interval for a regression coefficient, such as 𝛽1 (slope), you use the following formula:

import numpy as np
from scipy.stats import t

# Step 1: Define data
x = np.array([1, 2, 3, 4, 5])
y = np.array([2.2, 2.8, 3.6, 4.5, 5.1])

# Step 2: Calculate the slope (beta1) and intercept (beta0)
x_mean = np.mean(x)
y_mean = np.mean(y)
beta1 = np.sum((x - x_mean) * (y - y_mean)) / np.sum((x - x_mean) ** 2)
beta0 = y_mean - beta1 * x_mean

# Step 3: Calculate residuals, RSS, and RSE
y_pred = beta0 + beta1 * x
residuals = y - y_pred
RSS = np.sum(residuals ** 2)
n = len(x)
RSE = np.sqrt(RSS / (n - 2))

# Step 4: Calculate standard error of the slope (SE_beta1)
SE_beta1 = RSE / np.sqrt(np.sum((x - x_mean) ** 2))

# Step 5: Determine t-critical value for 95% confidence interval
alpha = 0.05  # 95% confidence level
df = n - 2  # Degrees of freedom
t_critical = t.ppf(1 - alpha/2, df)

# Step 6: Calculate confidence interval
lower_bound = beta1 - t_critical * SE_beta1
upper_bound = beta1 + t_critical * SE_beta1

# Step 7: Print results
print(f"Slope (beta1): {beta1:.3f}")
print(f"Standard Error (SE_beta1): {SE_beta1:.3f}")
print(f"t-Critical: {t_critical:.3f}")
print(f"95% Confidence Interval for beta1: ({lower_bound:.3f}, {upper_bound:.3f})")


Interpretation
- The 95% confidence interval for $𝛽_1$ is (0.467, 0.933).
- This means we are 95% confident that the true slope ($𝛽_1$) lies within this range.
- Since the interval does not include 0, it indicates that the relationship between x and 𝑦 is statistically significant at the 5% significance level.



In [None]:
########### calculate the t-statistic for a simple linear regression with one predictor (x) and one response variable (y).

import numpy as np
from scipy.stats import t

# Step 1: Define data
x = np.array([1, 2, 3, 4, 5])
y = np.array([2.2, 2.8, 3.6, 4.5, 5.1])

# Step 2: Calculate coefficients
x_mean = np.mean(x)
y_mean = np.mean(y)
beta1 = np.sum((x - x_mean) * (y - y_mean)) / np.sum((x - x_mean) ** 2)
beta0 = y_mean - beta1 * x_mean

# Step 3: Calculate residuals, RSS, and RSE
y_pred = beta0 + beta1 * x
residuals = y - y_pred
RSS = np.sum(residuals ** 2)
n = len(x)
RSE = np.sqrt(RSS / (n - 2))

# Step 4: Calculate standard error of the slope (SE_beta1)
SE_beta1 = RSE / np.sqrt(np.sum((x - x_mean) ** 2))

# Step 5: Calculate the t-statistic
t_statistic = beta1 / SE_beta1

# Step 6: Calculate p-value (two-tailed test)
df = n - 2  # Degrees of freedom
p_value = 2 * (1 - t.cdf(abs(t_statistic), df))

# Step 7: Print results
print(f"Slope (beta1): {beta1:.3f}")
print(f"Standard Error (SE_beta1): {SE_beta1:.3f}")
print(f"t-Statistic: {t_statistic:.3f}")
print(f"p-Value: {p_value:.5f}")


##### Interpretation
t-Statistic:
- The t-statistic for the slope is 7.435, indicating the estimated coefficient is significantly far from zero.

p-Value:
- The small p-value (0.00231) suggests strong evidence against the null hypothesis (𝛽_1 = 0).
- The predictor (x) is statistically significant at a 5% significance level.

By comparing the t-statistic to critical t-values or using the p-value, you can conclude whether the predictor is significantly associated with the response variable.

### Explaining the rules for rejecting the null hypothesis using p-values

1. What is a p-value?

The p-value is the probability of observing the data (or something more extreme) if the null hypothesis ($𝐻_0$) is true.
- A low p-value indicates that the observed result is unlikely under the assumption of the null hypothesis.

2. Decision Rule for Rejecting the Null Hypothesis

The decision rule depends on the significance level (𝛼), which is the threshold for rejecting. 
- Common choices for α are 0.05 (5%) or 0.01 (1%).

If p-value ≤ 𝛼: Reject the null hypothesis ($𝐻_0$).
- The result is statistically significant.
- There is strong evidence against the null hypothesis.

If p-value > α: Fail to reject the null hypothesis ($𝐻_0$).
- The result is not statistically significant.
- There isn’t enough evidence to conclude that the null hypothesis is false.

3. Interpretation Guidelines

Small p-value (≤𝛼):
- The observed effect is unlikely due to chance alone.
Example: 
- p=0.03 suggests that there’s only a 3% chance of observing your data if $𝐻_0$ were true.

Large p-value (>α):
- The observed effect could plausibly occur due to chance.
Example: 
- p=0.10 suggests that there’s a 10% chance of observing your data if $𝐻_0$ were true.

Common Misinterpretations to Avoid

1. The p-value is not the probability that $𝐻_0$ is true.
- It reflects the likelihood of observing the data assuming $𝐻_0$ is true.
2. Failing to reject $𝐻_0$ does not mean $𝐻_0$ is true.
- It only means there isn’t enough evidence to conclude otherwise.
3. A small p-value does not indicate a large effect size.
- Statistical significance doesn’t always mean practical significance.

Summary

- The choice of significance level (α) determines the threshold for rejecting $𝐻_0$
​- The p-value provides a way to quantify the strength of evidence against $𝐻_0$.
- Always report both the p-value and 𝛼 for transparency in hypothesis testing.

##### Practical Example
Suppose you are testing whether a new marketing strategy improves sales:

Null Hypothesis ($𝐻_0$): The new marketing strategy has no effect on sales (β=0).

Alternative Hypothesis ($𝐻_a$): The new marketing strategy increases sales (β>0).

If your analysis gives a p-value of 0.03, and you have set α=0.05:
- Since p-value (0.03) < α (0.05), you reject ($𝐻_0$).
- You conclude that the new marketing strategy likely increases sales.

### Regression cost functions: Regression model evaluation metrics

**loss function** is for a single training example. It is also sometimes called an error function. 

**cost function**, on the other hand, is the average loss over the entire training dataset. 

**Steps for Loss Functions**
1. Define the predictor function f(X), and identify the parameters to find.
2. Determine the loss for each training example.
3. Derive the expression for the Cost Function, representing the average loss across all examples.
4. Compute the gradient of the Cost Function concerning each unknown parameter.
5. Select the learning rate and execute the weight update rule for a fixed number of iterations.

These steps guide the optimization process, aiding in the determination of optimal model parameters.

Regression model we generally use to evaluate the prediction error rates and model performance in regression analysis.

1. **R-squared (Coefficient of determination)** 
- Indicates the proportion of variance in the dependent variable explained by the independent variables. 
- It represents the coefficient of how well the values fit compared to the original values. 
- It helps answer the question: "How well does my model explain the variability in the dependent variable?"
- The value from 0 to 1 interpreted as percentages. 
    - where: 
        - $𝑅^2$ = 1 (close to 1): Perfect fit (all variability in y is explained by X).
            - The model explains a large proportion of the variability in the data.
            - A large proportion of the variance in the dependent variable is explained by the independent variables.
            - Example: If $𝑅^2$ = 0.85
                - Then 85% of the variability in y is explained by X. The remaining 15% is due to unexplained variability (e.g., noise, unobserved variables).
        - $𝑅^2$ = 0 (close to 0): No relationship (the model does not explain any variability in y).
            - The model fails to explain much of the variability or A small proportion of the variance is explained.
            - Example: If $𝑅^2$ = 0.1
                - only 10% of the variability is explained by the model. This suggests either:
                    - The model lacks important predictors.
                    - The relationship between X and y may not be linear.
                    - There is high variability in y that cannot be captured effectively.
    - The higher the value is, the better the model is / model fits the data better, but it does not necessarily mean the model is accurate in predictions. and does not imply causation or that the model is the best predictor.

R-squared statistic is calculated as:

$$𝑅^2 = 1 - \frac{SS_{res}}{SS_{tot}}$$

Where:
- $SS_{res}$ (Residual Sum of Squares): Sum of squared differences between actual and predicted values.

$$SS_{res} = \sum^{n}_{i = 1} (y_i - \hat{y_i})^2$$

- $SS_{tot}$ (Total Sum of Squares): Sum of squared differences between actual values and their mean.

$$SS_{res} = \sum^{n}_{i = 1} (y_i - \bar{y_i})^2$$

### Important Caveats
1. Overfitting in Complex Models
- High $𝑅^2$ may result from overfitting, especially in models with many predictors.
2. Does Not Imply Causation
- High $𝑅^2$ shows correlation, not causation. For example, a model may explain variability due to spurious relationships.
3. Limited Applicability to Prediction
- High $𝑅^2$ doesn’t guarantee that the model predicts new data well (check with metrics like RMSE on test data).

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import r2_score

# Step 1: Create synthetic data
np.random.seed(42)
X = np.random.rand(100, 1) * 10  # Independent variable
y = 3 * X.squeeze() + 7 + np.random.randn(100) * 3  # Dependent variable with noise

# Step 2: Fit a simple linear regression model
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X, y)
y_pred = model.predict(X)

# Step 3: Calculate R-squared manually
y_mean = np.mean(y)
SS_res = np.sum((y - y_pred) ** 2)  # Residual Sum of Squares
SS_tot = np.sum((y - y_mean) ** 2)  # Total Sum of Squares
R_squared = 1 - (SS_res / SS_tot)

print(f"Manual R-squared: {R_squared:.4f}")

# Step 4: Calculate R-squared using sklearn
R_squared_sklearn = r2_score(y, y_pred)
print(f"Sklearn R-squared: {R_squared_sklearn:.4f}")

# Plotting
plt.scatter(X, y, label="Actual Data", alpha=0.7)
plt.plot(X, y_pred, color="red", label="Fitted Line")
plt.xlabel("X")
plt.ylabel("y")
plt.legend()
plt.title(f"Linear Regression (R-squared: {R_squared:.4f})")
plt.show()

2. **Adjusted R-squared**
- Adjusted version of $𝑅^2$ that accounts for the number of predictors in the model.

$$𝑅^2_{adj} = 1 - \frac{(1 - 𝑅^2)(n -1)}{n - p - 1}$$

where:
- n is the number of observations, and
- p is the number of predictors.

The Significance of R-squared is:
    - if $R^2$ = 1 : Best-fit Line
    - if $R^2$ = 0.5 : still some errors
    - if $R^2$ = 0.05 : not performing well

Usage: Useful when comparing models with a different number of predictors

_____

3. Mean Error (ME)
- The error for each training data is calculated and then the mean value of all these errors is derived.
- Errors can be both negative and positive. So they can cancel each other out during summation giving zero mean error for the model.
- Not a recommended cost function but it does lay the foundation for other cost functions of regression models.

Residual Analysis

- Residuals: Differences between actual and predicted values ($𝑦_𝑖 − \hat{𝑦_𝑖}$).
- Analysis: Residual plots help diagnose issues like non-linearity, heteroscedasticity, and independence of errors.

_____

4. **MSE (Mean Squared Error)**
- known as L2 loss.
- represents the difference between the original and predicted values extracted by squared the average difference over the data set.
- Here a square of the difference between the actual and predicted value is calculated to avoid any possibility of negative error(drawback cause).
- It is measured as the average of the sum of squared differences between predictions and actual observations.
$$MSE = \frac{1}{n} \sum^{n}_{i = 1} (y_i - \hat{y_i})^2$$
- Since each error is squared, it helps to penalize even small deviations in prediction when compared to MAE. 
    - But if our dataset has outliers that contribute to larger prediction errors, then squaring this error further will magnify the error many times more and also lead to higher MSE error.
    - MSE loss function penalizes the model for making large errors by squaring them. Squaring a large quantity makes it even larger
        - it is less robust to outliers
        - not to be used if our data is prone to many outliers.

Usage: Penalizes larger errors more heavily than smaller ones.

Graphically
- It is a positive quadratic function (of the form $ax^2 + bx + c$ where $a > 0$)
- A quadratic function only has a global minimum. 
    - Since there are no local minima, we will never get stuck in one. 
- Hence, it is always guaranteed that Gradient Descent will converge (if it converges at all) to the global minimum.

In [None]:
def update_weights_MSE(m, b, X, Y, learning_rate):
    m_deriv = 0
    b_deriv = 0
    N = len(X)
    for i in range(N):
        # Calculate partial derivatives
        # -2x(y - (mx + b))
        m_deriv += -2*X[i] * (Y[i] - (m*X[i] + b))

        # -2(y - (mx + b))
        b_deriv += -2*(Y[i] - (m*X[i] + b))

    # We subtract because the derivatives point in direction of steepest ascent
    m -= (m_deriv / float(N)) * learning_rate
    b -= (b_deriv / float(N)) * learning_rate

    return m, b

**Advantages of R-squared**:

- Interpretability: $𝑅^2$ is unitless and ranges between 0 and 1, making it easy to understand and compare across datasets.
- Proportional Explanation: Quantifies the proportion of variance explained by the model, offering insights into model effectiveness.
- Model Comparison: Useful for comparing the explanatory power of different models or regression equations.

**Disadvantages of RSE Compared to R-squared**:

- RSE depends on the scale of the dependent variable, making it hard to compare across datasets with different units.
- RSE alone does not provide information on how much variance the model explains.

_______

5. **RMSE (Root Mean Squared Error)** 
- is the error rate by the square root of MSE.

$$RMSE = \sqrt{MSE}$$

Usage: Commonly used because it is in the same units as the dependent variable and emphasizes larger errors.

______

6. **MAE (Mean absolute error)**
- known as L1 Loss.
- represents the difference between the original and predicted values extracted by averaged the absolute difference over the data set.
- It is the average of the absolute differences between predicted and actual values.
- Absolute Error for each training example is the distance between the predicted and the actual values, irrespective of the sign.
    - it is the absolute difference between the actual and predicted values.
- Here an absolute difference between the actual and predicted value is calculated to avoid any possibility of negative error.
- It is measured as the average of the sum of absolute differences between predictions and actual observations.
    - It is robust to outliers thus it will give better results even when our dataset has noise or outliers.
    - MAE cost is more robust to outliers as compared to MSE
-  The cost is the Mean of these Absolute Errors

$$MAE = \frac{1}{n} \sum^{n}_{i = 1} |y_i - \hat{y_i}|$$

Usage: Provides an easily interpretable measure of error in the same units as the dependent variable.

In [None]:
def update_weights_MAE(m, b, X, Y, learning_rate):
    m_deriv = 0
    b_deriv = 0
    N = len(X)
    for i in range(N):
        # Calculate partial derivatives
        # -x(y - (mx + b)) / |mx + b|
        m_deriv += - X[i] * (Y[i] - (m*X[i] + b)) / abs(Y[i] - (m*X[i] + b))

        # -(y - (mx + b)) / |mx + b|
        b_deriv += -(Y[i] - (m*X[i] + b)) / abs(Y[i] - (m*X[i] + b))

    # We subtract because the derivatives point in direction of steepest ascent
    m -= (m_deriv / float(N)) * learning_rate
    b -= (b_deriv / float(N)) * learning_rate

    return m, b

7. **Mean Absolute Percentage Error (MAPE)**

Definition: The mean of the absolute percentage differences between predicted and actual values.

$$MAPE = \frac{1}{n} \sum^{n}_{i = 1} |\frac{y_i - \hat{y_i}}{y_i}| \times 100$$

Usage: Expresses error as a percentage, making it easier to interpret across datasets

___________

8. Huber Loss

- The Huber loss combines the best properties of MSE and MAE.
- It is quadratic for smaller errors and is linear otherwise (and similarly for its gradient). 
- It is identified by its delta parameter:

In [None]:
def update_weights_Huber(m, b, X, Y, delta, learning_rate):
    m_deriv = 0
    b_deriv = 0
    N = len(X)
    for i in range(N):
        # derivative of quadratic for small values and of linear for large values
        if abs(Y[i] - m*X[i] - b) <= delta:
          m_deriv += -X[i] * (Y[i] - (m*X[i] + b))
          b_deriv += - (Y[i] - (m*X[i] + b))
        else:
          m_deriv += delta * X[i] * ((m*X[i] + b) - Y[i]) / abs((m*X[i] + b) - Y[i])
          b_deriv += delta * ((m*X[i] + b) - Y[i]) / abs((m*X[i] + b) - Y[i])
    
    # We subtract because the derivatives point in direction of steepest ascent
    m -= (m_deriv / float(N)) * learning_rate
    b -= (b_deriv / float(N)) * learning_rate

    return m, b

##### Choosing the Right Measure
- Use R-squared or Adjusted R-squared to evaluate the proportion of variance explained.
- Use MAE, MSE, or RMSE for measuring prediction accuracy in units of the target variable.
- Use MAPE for interpreting errors as percentages.

**Step 5: Interpret the Results**

Residual Analysis:
- Check normal distribution and normality for the residuals.
- Homoscedasticity describes a situation in which error term is the same across all values of the independent variables. 
    - means that the residuals are equal across the regression line.

Interpretation of Regression Output
- R-Squared : is a statistical measure of fit that indicates how much variation of a dependent variable is explained by the independent variables. 
    - Higher R-Squared value represents smaller differences between the observed data and fitted values.

**Optimization technique/Strategy**

We will use Gradient Descent as an optimization strategy to find the regression line.
- Weight Update Rule

NB: Perform optimization on the training data and check its performance on a new validation data.

**Gradient Descent for Linear Regression**

What is gradient descent?
- lay man: 
    - It is a way of checking the ground near you and observe where the land tends to descend.
    - It gives an idea in what direction you should take your steps.
    - It helps models find the optimal set of parameters by iteratively adjusting them in the opposite direction of the gradient, aiming to find the optimal set of parameters.

Mathematical terms:
- find out the best parameters ($θ_1$) and ($θ_2$) for our learning algorithm.

Cost space is how our algorithm would perform when we choose a particular value for a parameter.

Cost Function is a function that measures the performance of a model for any given data. Cost Function quantifies the error between predicted values and expected values and presents it in the form of a single real number.

1. Make a hypothesis with initial parameters
- Hypothesis: $h_θ(x) = θ_0 + θ_1 x$
- Parameters: $θ_o, θ_1$
2. Calculate the Cost function
- Cost Function: $J(θ_o, θ_1) = \frac{1}{2m}\sum^{m}_{i = 1} (h_θ (x^{(i)}) - y^{i})^2$
3. The goal is to reduce the cost function, we modify the parameters by using the Gradient descent algorithm over the given data.
- Goal: $minimize_{θ_o, θ_1} J(θ_o, θ_1)$

**Gradient descent**

- one of the optimization algorithms that optimize the cost function (objective function) to reach the optimal minimal solution.
- aims to find the parameters that minimize this discrepancy and improve the model’s performance.
    - Need to reduce the cost function (MSE) for all data points. 
    - This is done by updating the values of the slope coefficient and the constant coefficient iteratively until we get an optimal solution for the linear function.

The algorithm operates by calculating the gradient of the cost function, 
- which indicates the direction and magnitude of the steepest ascent. 

However, since the goal is to minimize the cost function, gradient descent moves in the opposite direction of the gradient, 
- known as the negative gradient direction.

Iteratively updating the model’s parameters in the negative gradient direction, gradient descent gradually converges towards the optimal set of parameters that yields the lowest cost.

- Hyperparameter: learning rate, determines the step size taken in each iteration, influencing the speed and stability of convergence.

Gradient descent can be applied to:
- linear regression, 
- logistic regression, 
- neural networks, and 
- support vector machines.

**Definition**: Gradient descent is an iterative optimization algorithm for finding the local minimum of a function.

To find the local minimum of a function using gradient descent, we must take steps proportional to the negative of the gradient (move away from the gradient) of the function at the current point.
- If we take steps proportional to the positive of the gradient (moving towards the gradient), we will approach a local maximum of the function, and the procedure is called Gradient Ascent.

The goal of the gradient descent algorithm is to minimize the given function (say, cost function)
- it performs two steps iteratively:
1. Compute the gradient (slope), the first-order derivative of the function at that point
2. Make a step (move) in the direction opposite to the gradient. The opposite direction of the slope increases from the current point by alpha times the gradient at that point.
- number of steps you’re taking can be considered as the learning rate, and this decides how fast the algorithm converges to the minima.

This code creates a function called gradient_descent, which requires the training data, learning rate, and number of iterations as parameters.

Steps :
1. Sets weights and bias to arbitrary values during initialization.
2. Executes a set number of iterations for loops.
3. Computes the estimated y values by utilizing the existing weights and bias.
4. Calculates the discrepancy between expected and real y values.
5. Determines the changes in the cost function based on weights and bias.
6. Adjusts the weights and bias by incorporating the gradients and learning rate.
7. Outputs the acquired weights and bias.


In [None]:
import numpy as np

def gradient_descent(X, y, learning_rate, num_iters):
  """
  Performs gradient descent to find optimal weights and bias for linear regression.

  Args:
      X: A numpy array of shape (m, n) representing the training data features.
      y: A numpy array of shape (m,) representing the training data target values.
      learning_rate: The learning rate to control the step size during updates.
      num_iters: The number of iterations to perform gradient descent.

  Returns:
      A tuple containing the learned weights and bias.
  """

  # Initialize weights and bias with random values
  m, n = X.shape
  weights = np.random.rand(n)
  bias = 0

  # Loop for the number of iterations
  for i in range(num_iters):
    # Predict y values using current weights and bias
    y_predicted = np.dot(X, weights) + bias

    # Calculate the error
    error = y - y_predicted

    # Calculate gradients for weights and bias
    weights_gradient = -2/m * np.dot(X.T, error)
    bias_gradient = -2/m * np.sum(error)

    # Update weights and bias using learning rate
    weights -= learning_rate * weights_gradient
    bias -= learning_rate * bias_gradient

  return weights, bias

# Example usage
X = np.array([[1, 1], [2, 2], [3, 3]])
y = np.array([2, 4, 5])
learning_rate = 0.01
num_iters = 100

weights, bias = gradient_descent(X, y, learning_rate, num_iters)

print("Learned weights:", weights)
print("Learned bias:", bias)

How Does Gradient Descent Work?
1. The algorithm optimizes to minimize the model’s cost function.
2. The cost function measures how well the model fits the training data and defines the difference between the predicted and actual values.
3. The cost function’s gradient is the derivative with respect to the model’s parameters and points in the direction of the steepest ascent.
4. The algorithm starts with an initial set of parameters and updates them in small steps to minimize the cost function.
5. In each iteration of the algorithm, it computes the gradient of the cost function with respect to each parameter.
6. The gradient tells us the direction of the steepest ascent, and by moving in the opposite direction, we can find the direction of the steepest descent.
7. The learning rate controls the step size, which determines how quickly the algorithm moves towards the minimum.
8. The process is repeated until the cost function converges to a minimum. Therefore indicating that the model has reached the optimal set of parameters.
9. Different variations of gradient descent include batch gradient descent, stochastic gradient descent, and mini-batch gradient descent, each with advantages and limitations.
10. Efficient implementation of gradient descent is essential for performing well in machine learning tasks. The choice of the learning rate and the number of iterations can significantly impact the algorithm’s performance.

On the basis of differentiation techniques 
- Gradient descent requires Calculation of gradient by differentiation of cost function. We can either use first order differentiation or second order differentiation.
    - First order Differentiation
    - Second order Differentiation.

To update B 0 and B 1, we take gradients from the cost function. To find these gradients, we take partial derivatives for $B_0$ and $B_1$.

$J = \frac{1}{n} \sum^{n}_{i = 1} (𝛽_{0}+𝛽_{1} . x_i - y_i)^2$

$\frac{\partial J}{\partial 𝛽_{0}} = \frac{2}{n} \sum^{n}_{i = 1} (𝛽_{0}+𝛽_{1} . x_i - y_i)$

$\frac{\partial J}{\partial 𝛽_{1}} = \frac{2}{n} \sum^{n}_{i = 1} (𝛽_{0}+𝛽_{1} . x_i - y_i) . x_i$

$𝛽_{0} = 𝛽_{0} - \alpha . \frac{2}{n} \sum^{n}_{i = 1} ( y_{pred} - y_{i}) $

$𝛽_{1} = 𝛽_{1} - \alpha . \frac{2}{n} \sum^{n}_{i = 1} ( y_{pred} - y_{i}) . x_i $

Where: 
- The partial derivates are the gradients, and they are used to update the values of $B_0$ and $B_1$. 
- Alpha is the learning rate.

**Types of Gradient Descent**

Classified by two methods mainly:
- On the basis of data ingestion: choice of gradient descent algorithm depends on the problem at hand and the size of the dataset.

**Full Batch Gradient Descent Algorithm**:
- Batch gradient descent,
    - also known as vanilla gradient descent, 
- full batch gradient descent algorithms, you use whole data at once to compute the gradient.
    - It updates the model’s parameters using the gradient of the entire training set.
- It calculates the average gradient of the cost function for all the training examples and updates the parameters in the opposite direction.
    - calculates the error for each example within the training dataset.
    - The model is not changed until every training sample has been assessed. 
        - The entire procedure is referred to as a **cycle and a training epoch**.
- Batch gradient descent guarantees convergence to the global minimum but can be computationally expensive and slow for large datasets.
    - Batch gradient descent is suitable for small datasets.
    - Its computational efficiency, which produces a stable error gradient and a stable convergence.
- Drawbacks are that the stable error gradient can sometimes result in a state of convergence that isn’t the best the model can achieve. 
    - It also requires the entire training dataset to be in memory and available to the algorithm.

Advantages
- Fewer model updates mean that this variant of the steepest descent method is more computationally efficient than the stochastic gradient descent method.
- Reducing the update frequency provides a more stable error gradient and a more stable convergence for some problems.
- Separating forecast error calculations and model updates provides a parallel processing-based algorithm implementation.

Disadvantages
- A more stable error gradient can cause the model to prematurely converge to a suboptimal set of parameters.
- End-of-training epoch updates require the additional complexity of accumulating prediction errors across all training examples.
- The batch gradient descent method typically requires the entire training dataset in memory and is implemented for use in the algorithm.
- Large datasets can result in very slow model updates or training speeds.
- Slow and require more computational power.

#### Variants

##### Vanilla Gradient Descent, 

Vanilla means pure / without any adulteration.
- simplest form of gradient descent technique
    - main feature is that we take small steps in the direction of the minima by taking gradient of the cost function.

Pseudocode Vanilla Gradient Descent

$ update = learning rate * gradient of parameters$

$ parameters = parameters - update$

- make an update to the parameters by taking gradient of the parameters. 
- And multiplying it by a learning rate, which is essentially a constant number suggesting how fast we want to go the minimum. 4
**Learning rate** is a hyper-parameter and should be treated with care when choosing its value.

##### Gradient Descent with Momentum

Tweaks the above algorithm in such a way that we pay heed to the prior step before taking the next step.

Pseudocode Gradient Descent with Momentum

$ update = learning_rate * gradient$ 

$ velocity = previous_update * momentum$ 

$ parameter = parameter + velocity – update$ 

Introduces Velocity, which considers the previous update and a constant which is called momentum.

##### ADAGRAD

ADAGRAD uses adaptive technique for learning rate updation. In this algorithm, on the basis of how the gradient has been changing for all the previous iterations we try to change the learning rate.

Pseudocode ADAGRAD

$ grad_component = previous_grad_component + (gradient * gradient)$ 

$ rate_change = square_root(grad_component) + epsilon$

$ adapted_learning_rate = learning_rate * rate_change$

$update = adapted_learning_rate * gradient$

$parameter = parameter – update$

where:
-  epsilon is a constant which is used to keep rate of change of learning rate in check.

##### ADAM

ADAM is one more adaptive technique which builds on adagrad and further reduces it downside.
- consider this as momentum + ADAGRAD.

Pseudocode.

$ adapted_gradient = previous_gradient + ((gradient – previous_gradient) * (1 – beta1))$

$ gradient_component = (gradient_change – previous_learning_rate)$

$ adapted_learning_rate =  previous_learning_rate + (gradient_component * (1 – beta2))$

$ update = adapted_learning_rate * adapted_gradient$

$ parameter = parameter – update$

where:
- beta1 and beta2 are constants to keep changes in gradient and learning rate in check

There are also second order differentiation method like **l-BFGS**.

In [None]:
class GDRegressor:
    
    def __init__(self,learning_rate=0.01,epochs=100):
        
        self.coef_ = None
        self.intercept_ = None
        self.lr = learning_rate
        self.epochs = epochs
        
    def fit(self,X_train,y_train):
        # init your coefs
        self.intercept_ = 0
        self.coef_ = np.ones(X_train.shape[1])
        
        for i in range(self.epochs):
            # update all the coef and the intercept
            y_hat = np.dot(X_train,self.coef_) + self.intercept_
            #print("Shape of y_hat",y_hat.shape)
            intercept_der = -2 * np.mean(y_train - y_hat)
            self.intercept_ = self.intercept_ - (self.lr * intercept_der)
            
            coef_der = -2 * np.dot((y_train - y_hat),X_train)/X_train.shape[0]
            self.coef_ = self.coef_ - (self.lr * coef_der)
        
        print(self.intercept_,self.coef_)
    
    def predict(self,X_test):
        return np.dot(X_test,self.coef_) + self.intercept_

**Stochastic Gradient Descent Algorithm**
- stochastic you take a sample while computing the gradient.
    - It randomly selects a training dataset example, 
        - changes the parameters for each training sample one at a time for each training example in the dataset.
            - The regular updates give us a fairly accurate idea of the rate of improvement. (benefit)
    - computes the gradient of the cost function for that example, 
    - and updates the parameters in the opposite direction.
- stochastic gradient descent algorithm is more suitable for large datasets.
- It is computationally efficient and can converge faster than batch gradient descent. It can be noisy (produce noisy gradients), cause the error rate to fluctuate rather than gradually go down and may not converge to the global minimum.

Advantages
- You can instantly see your model’s performance and improvement rates with frequent updates.
- This variant of the steepest descent method is probably the easiest to understand and implement, especially for beginners.
- Increasing the frequency of model updates will allow you to learn more about some issues faster.
- The noisy update process allows the model to avoid local minima (e.g., premature convergence).
- Faster and require less computational power.
- Suitable for the larger dataset.

Disadvantages
- Frequent model updates are more computationally intensive than other steepest descent configurations, and it takes considerable time to train the model with large datasets.
- Frequent updates can result in noisy gradient signals. This can result in model parameters and cause errors to fly around (more variance across the training epoch).
- A noisy learning process along the error gradient can also make it difficult for the algorithm to commit to the model’s minimum error.

In [None]:
from sklearn.linear_model import SGDClassifier
X = [[0., 0.], [1., 1.]]
y = [0, 1]
clf = SGDClassifier(loss="hinge", penalty="l2", max_iter=5)
clf.fit(X, y)
SGDClassifier(max_iter=5)

**Mini-batch Gradient Descent**
- Mini-batch is a good compromise between the two and is often used in practice.
- updates the model’s parameters using the gradient of a small batch size of the training dataset, known as a mini-batch. 
- It calculates the average gradient of the cost function for the mini-batch and updates the parameters in the opposite direction.
- It is the most commonly used method in practice because combines the ideas of batch gradient descent with SGD.
        - strikes a balance between batch gradient descent’s effectiveness and stochastic gradient descent’s durability.
- It is computationally efficient and less noisy than stochastic gradient descent while still being able to converge to a good solution.
- Mini-batch sizes typically range from 50 to 256.

Advantages
- The model is updated more frequently than the stack gradient descent method, allowing for more robust convergence and avoiding local minima.
- Batch updates provide a more computationally efficient process than stochastic gradient descent.
- Batch processing allows for both the efficiency of not having all the training data in memory and implementing the algorithm.

Disadvantages
- Mini-batch requires additional hyperparameters “mini-batch size” to be set for the learning algorithm.
- Error information should be accumulated over a mini-batch of training samples, such as batch gradient descent.
- it will generate complex functions.

Configure Mini-Batch Gradient Descent:

- The mini-batch steepest descent method is a variant of the steepest descent method recommended for most applications, intense learning.
- Mini-batch sizes, commonly called “batch sizes” for brevity, are often tailored to some aspect of the computing architecture in which the implementation is running. 
        - For example, a power of 2 that matches the memory requirements of the GPU or CPU hardware, such as 32, 64, 128, and 256.
- The stack size is a slider for the learning process.
- Smaller values ​​allow the learning process to converge quickly at the expense of noise in the training process. Larger values ​​result in a learning - process that slowly converges to an accurate estimate of the error gradient.

In [None]:
class MBGDRegressor:
    
    def __init__(self,batch_size,learning_rate=0.01,epochs=100):
        
        self.coef_ = None
        self.intercept_ = None
        self.lr = learning_rate
        self.epochs = epochs
        self.batch_size = batch_size
        
    def fit(self,X_train,y_train):
        # init your coefs
        self.intercept_ = 0
        self.coef_ = np.ones(X_train.shape[1])
        
        for i in range(self.epochs):
            
            for j in range(int(X_train.shape[0]/self.batch_size)):
                
                idx = random.sample(range(X_train.shape[0]),self.batch_size)
                
                y_hat = np.dot(X_train[idx],self.coef_) + self.intercept_
                #print("Shape of y_hat",y_hat.shape)
                intercept_der = -2 * np.mean(y_train[idx] - y_hat)
                self.intercept_ = self.intercept_ - (self.lr * intercept_der)

                coef_der = -2 * np.dot((y_train[idx] - y_hat),X_train[idx])
                self.coef_ = self.coef_ - (self.lr * coef_der)
        
        print(self.intercept_,self.coef_)
    
    def predict(self,X_test):
        return np.dot(X_test,self.coef_) + self.intercept_

**Step 6: Use the Model for Decision-Making**

Understanding which factors significantly influence readmissions,

To do this, you need a systematic approach grounded in exploratory analysis, statistical rigor, and effective communication

1. Thinking Approach: Identifying Significant Factors
- Define the Business Objective
    - Objective: Identify key drivers of hospital readmissions (to improve patient care and optimize resource allocation)
    - Questions to Answer:
        - What are the strongest predictors of readmissions?
        - Which predictors can be influenced through policy or operational changes?
        - How much can readmissions be reduced if certain factors are addressed?

- Perform Exploratory Data Analysis (EDA)
    - Inspect Data Distributions: Use histograms and boxplots to understand the spread of variables.
    - Check Relationships:
        - Pairwise correlations for numerical variables (e.g., length_of_stay vs. readmissions).
        - Grouped summaries for categorical variables (e.g., readmissions across age groups).
        - Example Insights:
            - Patients with longer stays might have higher readmission risks.
            - Non-adherence to medication might strongly correlate with readmissions.

- Statistical Hypothesis Testing
    - Use statistical tests to confirm relationships:
        - T-tests for differences in means (e.g., medication adherence between high and low readmission groups).
        - Chi-square tests for independence between categorical variables (e.g., age group vs. readmission rates).

Example 1: Statistical Hypothesis Testing for Medication Adherence
- Objective: Determine if medication adherence significantly differs between patients who are readmitted and those who are not.
- Approach: Two-Sample t-Test
- Hypotheses: 
    - $𝐻_0$ : The mean adherence rate is the same for both groups (readmitted and not readmitted).
    - $𝐻_𝑎$ : The mean adherence rate differs between the groups.

- Steps:
    - Prepare the Data:
    - Split patients into two groups: "Readmitted" and "Not Readmitted."
    - Collect medication adherence rates for each group.

- Check Assumptions:
    - Normality: Use a Shapiro-Wilk or Kolmogorov-Smirnov test to check if adherence rates are normally distributed.
    - Equal Variance: Use Levene’s test or Bartlett’s test.

- Perform the t-Test:
    - If variances are equal, use a standard t-test. If not, use Welch’s t-test.

- Interpret Results: 
    - If $𝑝 < 0.05$, reject $𝐻_0$
    - Conclude that adherence rates differ significantly.

In [None]:
from scipy.stats import ttest_ind

# Example data
adherence_readmitted = [0.7, 0.65, 0.6, 0.75, 0.8]  # Adherence rates for readmitted
adherence_not_readmitted = [0.9, 0.85, 0.88, 0.92, 0.89]  # Adherence rates for not readmitted

# Perform t-test
t_stat, p_value = ttest_ind(adherence_readmitted, adherence_not_readmitted, equal_var=False)
print(f"T-statistic: {t_stat}, P-value: {p_value}")

Example 2: Statistical Hypothesis Testing for Age Group vs. Readmission Rates
- Objective: Test if age group (categorical variable) is independent of readmission status.
- Approach: Chi-Square Test of Independence
- Hypotheses:
    - $𝐻_0$ : Age group is independent of readmission status.
    - $𝐻_𝑎$ : Age group and readmission status are dependent.

- Steps:
    - Create a Contingency Table:
        - Rows: Age groups (e.g., <40, 40–60, >60).
        - Columns: Readmission status (e.g., Yes, No).

- Perform the Chi-Square Test:

- Interpret Results:
    - If $ 𝑝< 0.05$, reject $𝐻_0$​
    - Conclude that age group influences readmission rates.

In [None]:
import numpy as np
from scipy.stats import chi2_contingency

# Contingency table
table = np.array([[50, 200], [70, 230], [100, 300]])

# Perform Chi-Square Test
chi2, p_value, dof, expected = chi2_contingency(table)
print(f"Chi2 Statistic: {chi2}, P-value: {p_value}")

Example 3: Statistical Hypothesis Testing for Length of Stay (LOS)
- Objective: Compare Average LOS for Readmitted vs. Not Readmitted Patients
- Approach: Two-Sample t-Test
    - $𝐻_0$ : The mean LOS is the same for readmitted and non-readmitted patients.
    - $𝐻_𝑎$ : The mean LOS differs.
- Steps:
    - Prepare the Data:
    - Split patients into two groups: "Readmitted" and "Not Readmitted."
    - Collect medication Length of stay for each group.

- Check Assumptions:
    - Normality: Use a Shapiro-Wilk or Kolmogorov-Smirnov test to check if Lengths of stay are normally distributed.
    - Equal Variance: Use Levene’s test or Bartlett’s test.

- Perform the t-Test:
    - If variances are equal, use a standard t-test. If not, use Welch’s t-test.

- Interpret Results: 
    - If $𝑝 < 0.05$, reject $𝐻_0$
    - Conclude that adherence rates differ significantly.

Example 4: Relationship Between LOS and Readmission Rate
- Approach: ANOVA (Analysis of Variance)
- Objective: Check if LOS groups (<3 days, 3–7 days, >7 days) have significantly different readmission rates.
- Hypotheses: 
    - $𝐻_0$ : The mean readmission rate is the same across all LOS groups.
    - $𝐻_𝑎$ : At least one group differs.
- Steps:
    - Group the Data:
        - Divide LOS into groups.
        - Calculate readmission rates for each group.
- Perform ANOVA:
- Interpret Results:
    - If $𝑝 < 0.05$
    - reject $𝐻_0$
    - Conclude that LOS impacts readmission rates.

In [None]:
from scipy.stats import f_oneway

# Example data
readmission_short = [0.1, 0.12, 0.08, 0.15]  # Readmission rates for <3 days
readmission_medium = [0.2, 0.22, 0.25, 0.18]  # Readmission rates for 3–7 days
readmission_long = [0.35, 0.4, 0.38, 0.42]  # Readmission rates for >7 days

# Perform ANOVA
f_stat, p_value = f_oneway(readmission_short, readmission_medium, readmission_long)
print(f"F-statistic: {f_stat}, P-value: {p_value}")



- Build and Interpret a Regression Model
    - Fit the Linear Regression model to identify significant predictors:
    - Check p-values of coefficients: Variables with p-values below a chosen threshold (e.g., 0.05) are statistically significant.
    - Evaluate effect size: Large coefficients indicate strong influence on the target.
    - Test for interaction effects, such as how length_of_stay and severity jointly influence readmissions.

- Refine the Model
    - Handle multicollinearity: Use Variance Inflation Factor (VIF) to remove or combine highly correlated predictors.
    - Validate the model: Perform cross-validation to ensure robustness.

This will help the institute to:
- Improve medication adherence programs for high-risk patients.
- Extend hospital stays for patients with severe conditions if needed.
- Schedule follow-up visits more effectively to minimize readmission risks.

Example 2: Predicting Readmissions Based on LOS
- Approach: Linear Regression
- Objective: Use regression to predict readmissions based on LOS and other predictors.

##### Linear Regression Helps Solve This Problem
- Quantifies Relationships: Identifies and quantifies the factors contributing to readmissions.
- Predicts Outcomes: Provides actionable predictions to guide healthcare interventions.
- Allocates Resources: Helps prioritize patients who need more attention post-discharge.
- Supports Policy Changes: Enables data-driven policy improvements in patient care.

In [None]:
import statsmodels.api as sm

# Example data
X = [2, 4, 6, 8, 10]  # LOS
y = [0, 1, 0, 1, 1]  # Readmission (0 = No, 1 = Yes)

# Add constant for intercept
X = sm.add_constant(X)
model = sm.Logit(y, X).fit()
print(model.summary())

2. Presenting Findings to Senior Management and Board
- Tailor Communication to the Audience
    - Senior management: Focus on actionable insights, resource implications, and patient care improvements.
    - Board of directors: Emphasize high-level trends, financial impacts, and alignment with strategic goals.

- Structure of Presentation
    - Introduction
        - Start with the context: "Readmission rates are a critical indicator of hospital performance and patient care quality."
        - Summarize the objective: "This study identifies key factors driving readmissions and proposes targeted interventions."

    - Key Findings
        - Use visuals like 
            - bar charts, 
            - scatter plots, and 
            - regression coefficient tables:
                - Example: "Medication adherence has the strongest inverse relationship with readmissions. A 10% increase in adherence reduces readmissions by 5%."
            - Highlight statistical significance:
                - "Length of stay and severity are significant at p < 0.05, confirming their importance."
    
    - Implications
        - Show real-world impact: "Addressing non-adherence could prevent ~300 readmissions annually, saving $1.2M in costs."
        - Prioritize recommendations: "Focus on medication adherence programs, especially for older patients with comorbidities."

    - Actionable Recommendations
        - Immediate Steps:
            - Develop a post-discharge follow-up protocol for high-risk groups.
            - Launch an adherence monitoring program.
        - Future Research:
            - Investigate additional factors like social determinants of health.

    - Conclusion
        - Reinforce value: "By addressing these factors, we can improve patient outcomes, meet regulatory benchmarks, and reduce financial strain."

- Tools for Communication
    - Visual Dashboards: Create dashboards showing predicted readmissions, trends over time, and "what-if" scenarios.
    - Executive Summaries: Provide concise summaries with high-impact visuals and key takeaways.
    - Financial Impact Models: Quantify cost savings or ROI of proposed interventions.

3. Example Insights and Visualizations
Insight Example: Medication Adherence
    - Insight: "Medication adherence has a strong negative correlation with readmissions ($𝑅=−0.65$)
        - A 10% increase in adherence is associated with a 5% reduction in readmissions."

Visualization:
    - A bar chart comparing adherence rates and average readmissions.
    - Regression coefficient chart showing the magnitude of influence.

Insight Example: Length of Stay
    - Insight: "Patients with hospital stays >7 days are 2x more likely to be readmitted within 30 days."

Visualization:
    - Scatter plot: length_of_stay vs. readmissions.
    - Box plot: Readmission rates by length-of-stay categories.

4. Implementation Plan
Once the board approves, focus on operationalizing findings:

- Deploy targeted interventions for high-risk patients.
- Set KPIs to monitor the effectiveness of changes.
- Continuously refine the model based on new data.

##### Set KPIs to monitor the effectiveness of changes

**KPI 1: 30-Day Readmission Rate**
- Definition: Percentage of patients readmitted to the hospital within 30 days of discharge.
- Why Important: This is the primary metric to assess whether interventions are reducing readmissions.
- Formula: $Readmission Rate = \frac{Number of patients readmitted within 30 days}{Total number of discharged patients} × 100$
- Target: A reduction in the readmission rate over time indicates success.

**KPI 2: Medication Adherence Rate**
- Definition: Percentage of patients adhering to their prescribed medications post-discharge.
- Why Important: Non-adherence is a leading cause of readmissions. Monitoring this ensures interventions like counseling and follow-ups are effective
- Formula: $Medication Adherence Rate = \frac{Number of patients adhering to medications}{Total number of patients} × 100$
- Target: An increase in adherence correlates with better outcomes and fewer readmissions.

**KPI 3: Follow-Up Appointment Compliance**
- Definition: Percentage of discharged patients attending follow-up appointments within the recommended time frame.
- Why Important: Follow-up visits can identify issues early and prevent readmissions.
- Formula: $Compliance Rate= \frac{Number of scheduled follow-ups}{Number of attended follow-ups} × 100$
- Target: High compliance indicates improved patient engagement.

**KPI 4: Average Length of Stay (LOS)**
- Definition: Average number of days patients spend in the hospital.
- Why Important: Shorter stays can indicate efficiency but might increase readmissions if patients are discharged prematurely.
- Formula: $LOS= \frac{Number of discharges}{Total inpatient days}$
​- Target: Maintain an optimal LOS that balances cost and readmission prevention.

**KPI 5: Percentage of High-Risk Patients Identified**
- Definition: Proportion of discharged patients flagged as high-risk for readmission and targeted for interventions.
- Why Important: Monitoring ensures that predictive models and risk stratification tools are working effectively.
- Formula:$High-Risk Patients Identified = \frac{Total number of discharged patients}{Number of flagged high-risk patients} × 100$
- Target: Increase the identification rate while reducing actual readmissions.

##### Presenting KPIs to Stakeholders

**Visual Presentation**

Use dashboards and visualizations:
- Bar charts to compare readmission rates before and after interventions.
- Line graphs showing trends over time for medication adherence and follow-up compliance.
- Heatmaps for condition-specific readmission trends.

Narrative
- Highlight success: "We reduced the 30-day readmission rate from 18% to 12%, saving $500,000 annually."
- Focus on actionable insights: "Medication adherence programs have been effective, with a 15% increase in adherence leading to a 5% drop in readmissions."

Recommendations
- Continue monitoring these KPIs for sustained improvements.
- Scale successful interventions to other patient groups or hospitals.

## 2. Multiple Linear Regression:

simple linear regression equation is as follows:

$$Y = \beta_{0} + \beta_{1}X_1$$

where:
- $\beta_{0}$ is the intercept, interpreted as the value of $Y$ when $X_1 = 0$;
- $\beta_{1}$ is the coefficient, interpreted as the effect on $Y$ for a one unit increase in $X_1$; and
- $X_1$ is the single predictor variable.

Extending that idea to multiple linear regression is as simple as adding an $X_{j}$ and corresponding $\beta_{j}$ for each of the $p$ predictor variables, where $j$ is an element of the set $[1,p]$.
   
Hence in multiple linear regression, our regression equation becomes:   

$$Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_pX_p$$

where:

- $Y$ is the reponse variable which depends on the $p$ predictor variables;
- $\beta_0$ is the intercept, interpreted as the value of $Y$ when _all_ predictor variables are equal to zero;
- $\beta_j$ is the average effect on $Y$ of a one unit increase in $X_j$, assuming all other predictors are held fixed.

Multiple linear regression is a technique to understand the relationship between a single dependent variable and multiple independent variables.

$$ 𝑦=𝛽_{0}+𝛽_{1}𝑥_{1}+…+𝛽_{𝑛}𝑥_{𝑛}+ 𝜖 $$

What it means:
- It is used when two or more independent variables influence the dependant variable. 

- A linear equation defines the relationship, with the 
    - coefficients of the independent variables 
    
- representing the effect of each variable on the dependant variable.

# Assumptions of Multiple Linear Regression

Regression is a parametric approach, which means that it makes assumptions about the data

For successful regression analysis, it’s essential to validate the following assumptions.

- Overfitting: When more and more variables are added to a model, the model may become far too complex and usually ends up memorizing all the data points in the training set
    - This phenomenon is known as the overfitting of a model. 
    - This usually leads to high training accuracy and very low test accuracy.
- Understanding of linearity and multicollinearity (predictors).
    - It is the phenomenon where a model with several independent variables, may have some variables interrelated.
- Understanding of independence, homoscedasticity, and normality (residuals).
- Feature Selection: With more variables present, selecting the optimal set of predictors from the pool of given features (many of which might be redundant) becomes an important task for building a relevant and better model.

We'll be moving through the following sections in order to achieve our objectives:

- Investigating our predictor variables:
    - Checking for linearity;
    - Checking for multicollinearity;
- Fitting a model with `statsmodels.OLS`;
- Evaluating our fitted model:
    - Checking for independence;
    - Checking for homoscedasticity;
    - Checking for normaility;
    - Checking for outliers.

# Checking for Linearity

Linearity is a key assumption in multilinear regression. It states that the relationship between each predictor and the response variable should be linear. When this assumption is violated, the model's predictions may be biased or less effective.

The first thing we need to check is the mathematical relationship between each predictor variable and the response variable. == linearity. 
- A linear relationship means that a change in the response *Y* due to a one-unit change in the predictor $X_j$ is constant, regardless of the value of $X_j$.

If we fit a regression model to a dataset that is non-linear, 
- it will fail to adequately capture the relationship in the data - resulting in a mathematically inappropriate model. 

#### Non-Linearity
Issue:
- Linear regression assumes a linear relationship between the independent variables (𝑋) and the dependent variable (Y). 
- If the true relationship is nonlinear, the model may provide poor predictions and incorrect interpretations.

Detection:
- Residual Plots: If the residuals (errors) show a curved or non-random pattern, non-linearity is likely present.
- Scatterplots: Plot 𝑋 vs. Y to check for a linear relationship.
- Polynomial or Log Transformations: Fit higher-order models and compare performance.

Solutions:
- Apply transformations (e.g., log, square root, Box-Cox) to make the relationship more linear.
- Use polynomial regression (e.g., adding $𝑋^2 terms).
- Consider using non-linear models such as decision trees or splines.

### Detecting Non-Linearity

To check for linearity, 
- we can produce scatter plots of each individual predictor against the response variable. 
- The intuition here is that we are looking for obvious linear relationships.

**Result**

- State what appears of the variables that have an approximately linear relationship.
- State that exhibits no linearity with resonse variable

In [None]:
fig, axs = plt.subplots(2,5, figsize=(14,6),)
fig.subplots_adjust(hspace = 0.5, wspace=.2)
axs = axs.ravel()

for index, column in enumerate(df.columns):
    axs[index-1].set_title("{} vs. mpg".format(column),fontsize=16)
    axs[index-1].scatter(x=df[column],y=df['mpg'],color='blue',edgecolor='k')
    
fig.tight_layout(pad=1)

Step 1: Diagnosing Non-Linearity

Visual Inspection
- Use scatter plots to visualize the relationship between predictors and the response variable.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Scatter plots for each predictor vs. response
for predictor in ["SquareFootage", "Bedrooms", "DistanceFromCityCenter"]:
    sns.scatterplot(x=df[predictor], y=df["HousePrice"])
    plt.title(f"{predictor} vs. HousePrice")
    plt.xlabel(predictor)
    plt.ylabel("HousePrice")
    plt.show()


In [None]:
# Pairplot to visualize relationships
sns.pairplot(df, x_vars=["SquareFootage", "Bedrooms", "DistanceFromCityCenter"], y_vars="HousePrice", kind="reg")
plt.show()

Residual Plots
- Residual plots help check for linearity by plotting residuals against predicted values.

In [None]:
# Predicted values and residuals
predicted = model.predict(X)
residuals = Y - predicted

# Residual plot
plt.scatter(predicted, residuals)
plt.axhline(0, color="red", linestyle="--")
plt.title("Residual Plot")
plt.xlabel("Predicted Values")
plt.ylabel("Residuals")
plt.show()


If you see a pattern (e.g., curves or increasing spread), it indicates non-linearity.

If the relationship between variables is non-linear Applying Log Transformation:

Step 2: Transforming Predictors or the Response Variable

When to Transform
- Predictors: Transform when individual predictors have non-linear relationships with the response.
- Response Variable: Transform when the response itself shows a skewed distribution or non-linear relationship with predictors.

|Transformation |Formula|Use Case|
|---------------|-------|--------|
|Log            |$log(x)$|Skewed data, multiplicative relationships, exponential growth.
|Square Root	|$\sqrt{x}$ |Reduces spread while preserving the order of values.|
|Polynomial	    |$𝑥^2, x^3,...$|For non-linear relationships that resemble curves.|
|Reciprocal		|$\frac{1}{x}$|When values decrease rapidly as the predictor increases.|
|Box-Cox		|$y^{𝜆}$|Optimal transformation for normalizing data or reducing variance.|

Step 3: Applying Transformations in Python

Example 1: Log Transformation
- Suppose SquareFootage has a non-linear relationship with HousePrice.

In [None]:
# Log transformation of SquareFootage
df["Log_SquareFootage"] = np.log(df["SquareFootage"])

# Fit the model again
X_trans = df[["Log_SquareFootage", "Bedrooms", "DistanceFromCityCenter"]]
X_trans = sm.add_constant(X_trans)
model_trans = sm.OLS(Y, X_trans).fit()

print(model_trans.summary())

Example 2: Polynomial Transformation
- Suppose DistanceFromCityCenter has a curved relationship with HousePrice.

In [None]:
# Add polynomial terms
df["Distance_Squared"] = df["DistanceFromCityCenter"] ** 2

# Fit model with polynomial term
X_poly = df[["SquareFootage", "Bedrooms", "DistanceFromCityCenter", "Distance_Squared"]]
X_poly = sm.add_constant(X_poly)
model_poly = sm.OLS(Y, X_poly).fit()

print(model_poly.summary())


Example 3: Box-Cox Transformation for Response Variable
- Normalize HousePrice if it's highly skewed.

In [None]:
from scipy.stats import boxcox

# Box-Cox transformation
Y_boxcox, lambda_boxcox = boxcox(Y)
print(f"Optimal lambda for Box-Cox: {lambda_boxcox}")

# Fit model with transformed response
model_boxcox = sm.OLS(Y_boxcox, X).fit()
print(model_boxcox.summary())


Step 4: Comparing Models
Use metrics like Adjusted $𝑅^2$ , AIC, and BIC to compare the effectiveness of models before and after transformations.

In [None]:
# Compare models
print("Original Model AIC:", model.aic)
print("Log-Transformed Model AIC:", model_log.aic)
print("Polynomial Model AIC:", model_poly.aic)
print("Box-Cox Model AIC:", model_boxcox.aic)

Step 5: Visualizing and Validating Improvements
- Visualizing Residuals After Transformation

In [None]:
# Residual plot after transformation
predicted_trans = model_log.predict(X_trans)
residuals_trans = Y - predicted_trans

plt.scatter(predicted_trans, residuals_trans)
plt.axhline(0, color="red", linestyle="--")
plt.title("Residual Plot After Transformation")
plt.xlabel("Predicted Values")
plt.ylabel("Residuals")
plt.show()


Checking $𝑅^2$ and Adjusted $𝑅^2$ Compare values before and after applying transformations.

In [None]:
print(f"Original Model R^2: {model.rsquared}")
print(f"Transformed Model R^2: {model_log.rsquared}")

# Checking for Multicollinearity

Multicollinearity occurs when predictors in a regression model are highly correlated. This can inflate standard errors, making it difficult to assess the individual impact of predictors on the response variable.
- As multicollinearity makes it difficult to find out which variable is contributing towards the prediction of the response variable, it leads one to conclude incorrectly, the effects of a variable on the target variable.
- Properly detect and deal with the multicollinearity present in the model, as random removal of any of these correlated variables from the model causes the coefficient values to swing wildly and even change signs.

Multicollinearity refers to the presence of strong correlation among two or more of the predictor variables in the dataset. The presence of any correlation among predictors is detrimental to model quality for two reasons:

- It tends to increase the standard error;

- It becomes difficult to estimate the effect of any one predictor variable on the response variable.

We will check for multicollinearity by generating 
- pairwise scatter plots among predictors
- a correlation heatmap.

#### Collinearity (Multicollinearity)
Issue:
- Multicollinearity occurs when two or more independent variables are highly correlated. 
- This makes it difficult to isolate their individual effects, leading to unstable coefficients and inflated standard errors.

Detection:
- Variance Inflation Factor (VIF): VIF > 10 suggests high multicollinearity.
- Correlation Matrix: High pairwise correlations (> 0.8) indicate potential collinearity.
- Condition Index: A high condition number (> 30) suggests multicollinearity.

Solutions:
- Remove redundant predictors.
- Use Principal Component Analysis (PCA), Ridge Regression, or Lasso Regression to handle collinearity.
- Combine correlated variables into a single composite variable (e.g., sum or average).


Multicollinearity can be detected using the following methods.

- Pairwise Correlations: Checking the pairwise correlations between different pairs of independent variables can throw useful insights into detecting multicollinearity.
    - Pairwise correlations may not always be useful as it is possible that just one variable might not be able to completely explain some other variable but some of the variables combined could be ready to do this.  Thus, to check these sorts of relations between variables, one can use VIF:
- Variance Inflation Factor (VIF): VIF explains the relationship of one independent variable with all the other independent variables. 
    - VIF is given by,

$ VIF = \frac{1}{1 - R^2}$

where 
- $i$ refers to the $ith$ variable which is being represented as a linear combination of the rest of the independent variables.

Heuristics
- if VIF > 10 then the value is high and it should be dropped.
- if the VIF=5 then it may be valid but should be inspected first.
- if VIF < 5, then it is considered a good VIF value.

**Step 1: Detecting Multicollinearity**

(a) Pairwise scatter plots

As can be inferred by the name, a pairwise scatter plot simply produces a visual $n \times n$ matrix, where $n$ is the total number of variables compared, in which each cell represents the relationship between two variables. The diagonal cells of this visual represent the comparison of a variable with itself, and as such are substituted by a representation of the distribution of values taken by the visual.


In [None]:
# Due to the number of visuals created, this codeblock takes about one minute to run.
from seaborn import pairplot
g = pairplot(df1.drop('mpg', axis='columns'))
g.fig.set_size_inches(9,9)

(b) Correlation Matrix
- Use a correlation matrix to identify highly correlated predictors.

Correlation heatmap

Another way we can visually discover linearity between two or more variables within our dataset is through the use of a correlation heatmap. Similar to the pairwise scatter plot we produced above, this visual presents a matrix in which each row represents a distinct variable, with each colum representing the correlation between this variable and another one within the dataset.

Result Interpretation
- Look for correlations > 0.8 or < -0.8, which may indicate multicollinearity.

In [None]:
# We only compare the predictor variables, and thus drop the target `mpg` column.
corr = df1.drop('mpg', axis='columns').corr()

from statsmodels.graphics.correlation import plot_corr

fig=plot_corr(corr,xnames=corr.columns)

In [None]:
import pandas as pd

# Compute correlation matrix
correlation_matrix = df[["SquareFootage", "Bedrooms", "DistanceFromCityCenter"]].corr()

# Display the heatmap
import seaborn as sns
import matplotlib.pyplot as plt

sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm")
plt.title("Correlation Matrix")
plt.show()


(c) Variance Inflation Factor (VIF)
- VIF quantifies how much the variance of a regression coefficient is inflated due to multicollinearity.

Result Interpretation

Rule of Thumb:
- $VIF=1$: No multicollinearity.
- $1<VIF<5$: Low multicollinearity.
- $VIF>5$: High multicollinearity.
- $VIF>10$: Severe multicollinearity.

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Calculate VIF for each predictor
vif_data = pd.DataFrame()
vif_data["Feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

print(vif_data)

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Prepare data for VIF calculation
X_vif = df[["SquareFootage", "Bedrooms", "DistanceFromCityCenter"]]
X_vif = sm.add_constant(X_vif)

# Calculate VIF for each predictor
vif_data = pd.DataFrame()
vif_data["Feature"] = X_vif.columns
vif_data["VIF"] = [variance_inflation_factor(X_vif.values, i) for i in range(X_vif.shape[1])]

print(vif_data)

**Step 2: Mitigating Multicollinearity**

1. Drop Highly Correlated Predictors:
- If two predictors are highly correlated, remove one to reduce redundancy.

In [None]:
X_reduced = X.drop(columns=["Bedrooms"])  # Example

In [None]:
# Drop 'Bedrooms' if it has high multicollinearity
X_reduced = df[["SquareFootage", "DistanceFromCityCenter"]]
X_reduced = sm.add_constant(X_reduced)

# Fit the model with reduced predictors
model_reduced = sm.OLS(Y, X_reduced).fit()
print(model_reduced.summary())

2. Apply Ridge or Lasso Regression:

- Ridge regression penalizes large coefficients to handle multicollinearity.
- Lasso regression performs feature selection by shrinking some coefficients to zero.

In [None]:
from sklearn.linear_model import Ridge, Lasso

ridge = Ridge(alpha=1.0)  # Regularization strength
ridge.fit(X, Y)
print("Ridge coefficients:", ridge.coef_)

lasso = Lasso(alpha=0.1)
lasso.fit(X, Y)
print("Lasso coefficients:", lasso.coef_)

In [None]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Split data
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

# Ridge regression
ridge = Ridge(alpha=1.0)  # Adjust alpha (regularization strength) as needed
ridge.fit(X_train, Y_train)

# Evaluate Ridge model
ridge_predictions = ridge.predict(X_test)
ridge_mse = mean_squared_error(Y_test, ridge_predictions)
print("Ridge Regression MSE:", ridge_mse)
print("Ridge Coefficients:", ridge.coef_)

In [None]:
from sklearn.linear_model import Lasso

# Lasso regression
lasso = Lasso(alpha=0.1)  # Adjust alpha as needed
lasso.fit(X_train, Y_train)

# Evaluate Lasso model
lasso_predictions = lasso.predict(X_test)
lasso_mse = mean_squared_error(Y_test, lasso_predictions)
print("Lasso Regression MSE:", lasso_mse)
print("Lasso Coefficients:", lasso.coef_)


3. Principal Component Analysis (PCA):
- PCA reduces dimensions by transforming correlated predictors into uncorrelated components.

Interpreting PCA:
- Principal components represent uncorrelated combinations of the original predictors.
- The explained variance ratio tells you how much variance is captured by each componen

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)  # Reduce dimensions
X_pca = pca.fit_transform(X.iloc[:, 1:])

In [None]:
from sklearn.decomposition import PCA
import numpy as np

# Scale predictors for PCA
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X.iloc[:, 1:])  # Exclude constant term

# Apply PCA
pca = PCA(n_components=2)  # Choose number of components
X_pca = pca.fit_transform(X_scaled)

# Fit model with PCA components
model_pca = sm.OLS(Y, sm.add_constant(X_pca)).fit()
print(model_pca.summary())

**Step 3: Comparing Models**

Evaluate model performance before and after applying mitigation techniques using metrics such as:

- Mean Squared Error (MSE)
- Adjusted $𝑅^2$ 
- Akaike Information Criterion (AIC)

In [None]:
# Compare model AIC
print("Original Model AIC:", model.aic)
print("Reduced Model AIC:", model_reduced.aic)
print("Ridge Model MSE:", ridge_mse)
print("Lasso Model MSE:", lasso_mse)

# Outliers in Data points
Issue:
- Outliers are extreme data points that can disproportionately affect regression estimates, leading to biased coefficients and distorted predictions.

Detection:
- Boxplots and scatterplots: Identify extreme values.
- Standardized residuals: If residuals exceed ±3 standard deviations, they may be outliers.
- Cook’s Distance: Values > 0.5–1 indicate influential points.

Solutions:
- Investigate the cause: Check for data entry errors.
- Transform variables (log transformation can reduce outlier influence).
- Use robust regression techniques (e.g., RANSAC, Huber regression).
- Remove extreme outliers only if they are genuine data errors.

### Impact of Outliers on Different Machine Learning Models
Outliers can significantly impact certain models while others are more robust. Here's a breakdown of models that are least and most affected by outliers:

##### **Most Affected Models (Sensitive to Outliers)**
These models assume normally distributed data or use distance-based calculations, making them highly sensitive to extreme values.

**Linear Regression**
- Uses least squares optimization, which heavily penalizes outliers.
- Outliers can skew the regression line, leading to biased coefficients.
- Solution: Use Robust Regression or Log Transform data.

**Logistic Regression**
- While more resistant than linear regression, extreme values in features can affect decision boundaries.
- Solution: Use Regularization (L1/L2 penalty) or Winsorization.

**K-Nearest Neighbors (KNN)**
- Distance-based, so outliers can pull the nearest neighbors incorrectly.
- Solution: Use scaled data (StandardScaler) or Minkowski distance with low sensitivity.

**Support Vector Machines (SVM)**
- Tries to maximize the margin between classes, but outliers can change the margin dramatically.
- Solution: Use soft-margin SVM (C parameter tuning).

**Principal Component Analysis (PCA)**
- Based on variance maximization, meaning outliers distort the principal components.
- Solution: Use Robust PCA or detect outliers before applying PCA.

**Neural Networks (Deep Learning Models)**
- Can handle outliers if trained with enough data, but in small datasets, outliers cause unstable learning.
- Solution: Use batch normalization or dropout layers.

##### **Least Affected Models (Robust to Outliers)**
These models ignore outliers naturally or are designed to handle them effectively.

**Tree-Based Models (Decision Trees, Random Forest, XGBoost, LightGBM)**
- Trees split data based on conditions, making them naturally robust to outliers.
- Extreme values do not impact splits significantly.
- Solution: No need for strict outlier handling.

**Robust Regression (Huber Regression, Theil-Sen Estimator, RANSAC Regression)**
- Modifies loss function to reduce the effect of large errors (outliers).
- Best choice for handling outliers in regression.

**Gradient Boosting Models (XGBoost, CatBoost, LightGBM)**
- Uses tree-based structures, so outliers do not heavily impact performance.
- Solution: Still benefits from log transformation if outliers exist.

**K-Means with Outlier Detection (DBSCAN, K-Medoids, Gaussian Mixture Model)**
- Traditional K-Means is affected by outliers, but DBSCAN and K-Medoids are robust.
- Solution: Use DBSCAN instead of K-Means.

Key Takeaways
- Use Tree-Based Models (Random Forest, XGBoost, LightGBM) to avoid outlier issues.
- For regression, use Robust Regression (Huber, Theil-Sen, RANSAC).
- Use outlier-resistant clustering like DBSCAN instead of K-Means.
- Preprocess data (log transform, Winsorization, trimming) for models sensitive to outliers.

#### Outlier Detection & Handling in Python
This script covers:
- ✅ Identifying outliers using IQR, Z-score, and Isolation Forest
- ✅ Handling outliers with Winsorization, transformation, and robust regression
- ✅ Applying outlier-resistant models

🔹 Step 1: Generate Sample Data

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Generate data with outliers
np.random.seed(42)
data = np.random.normal(loc=50, scale=15, size=100).tolist()  # Normal data
data += [150, 170, 200]  # Add extreme outliers
df = pd.DataFrame({'Value': data})

# Visualize outliers
plt.figure(figsize=(8, 5))
sns.boxplot(x=df['Value'])
plt.title('Boxplot Showing Outliers')
plt.show()


🔹 Step 2: Detect Outliers

1️⃣ IQR Method (Boxplot Approach)

2️⃣ Z-Score Method

3️⃣ Isolation Forest (Machine Learning Approach)

In [None]:
# 1

Q1 = df['Value'].quantile(0.25)
Q3 = df['Value'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

df['Outlier_IQR'] = (df['Value'] < lower_bound) | (df['Value'] > upper_bound)
print(df[df['Outlier_IQR']])

# 2

df['Z_Score'] = np.abs(stats.zscore(df['Value']))
df['Outlier_Z'] = df['Z_Score'] > 3
print(df[df['Outlier_Z']])

# 3

from sklearn.ensemble import IsolationForest

iso_forest = IsolationForest(contamination=0.05, random_state=42)
df['Outlier_IF'] = iso_forest.fit_predict(df[['Value']])
df['Outlier_IF'] = df['Outlier_IF'] == -1  # Convert -1 (outliers) to True
print(df[df['Outlier_IF']])

🔹 Step 3: Handling Outliers

1️⃣ Winsorization (Capping Outliers)

2️⃣ Log Transformation (For Right-Skewed Data)

3️⃣ Robust Regression (RANSAC for Outliers)

In [None]:
# 1 
from scipy.stats.mstats import winsorize

df['Value_Winsorized'] = winsorize(df['Value'], limits=[0.05, 0.05])  # Cap top/bottom 5%
sns.boxplot(x=df['Value_Winsorized'])
plt.title('Winsorized Data')
plt.show()

# 2 
df['Value_Log'] = np.log(df['Value'].clip(lower=1))  # Avoid log(0)
sns.histplot(df['Value_Log'], kde=True)
plt.title('Log Transformed Data')
plt.show()

# 3
from sklearn.linear_model import RANSACRegressor

X = np.array(range(len(df))).reshape(-1, 1)  # Dummy feature
y = df['Value']

ransac = RANSACRegressor()
ransac.fit(X, y)
y_pred = ransac.predict(X)

plt.scatter(X, y, label="Original Data")
plt.plot(X, y_pred, color='red', label="Robust Regression")
plt.legend()
plt.show()


🔹 Step 4: Using Outlier-Resistant Models

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

# Create dummy features
df['Feature'] = np.random.normal(loc=10, scale=5, size=len(df))

X = df[['Feature']]
y = df['Value']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Use Random Forest (Robust to Outliers)
model = RandomForestRegressor()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print("Random Forest Model Trained Successfully")


# High-Leverage Points
High Leverage Points and the Leverage Statistic

### What Are High Leverage Points?
- High leverage points are observations with extreme predictor values (X-values) compared to the rest of the data.
- They have a strong influence on the fitted regression model because they affect the estimated regression coefficients.
- A high leverage point may or may not be an outlier in terms of Y (dependent variable), but it is extreme in X.

Why Are High Leverage Points Important?
- They can distort regression results, leading to biased coefficients.
- If a high leverage point is also an outlier in Y, it has an outsized impact on model predictions.
- Identifying and addressing high leverage points improves model robustness.

Issue:
- High-leverage points are observations with extreme values for predictor variables (𝑋). 
- They can unduly influence the regression model, even if they are not outliers in 𝑌.

### Detecting High Leverage Points Using the Leverage Statistic
The leverage of an observation i is measured by the hat value $ℎ_𝑖$, derived from the hat matrix 𝐻
- helps detect high leverage points.
$$ H = X(X^T X)^{-1} X^T$$

Each observation has a leverage score:
$$ h_i = X_i(X^T X)^{-1} X^T_i$$
- Where:
    - $X_i$ is the row vector of predictor values for observation i.
    - $ℎ_𝑖$ measures how far $𝑋_𝑖$ is from the mean of all predictor values.

Threshold for High Leverage

High leverage points have predictor values that deviate significantly from the rest of the data.
A common rule: $h_i > \frac{2(k+1)}{n}$
- Where:
    - k = number of predictors,
    - n = number of observations.

High leverage points should be carefully examined, not automatically removed.

Detection:
- Leverage statistic (Hat values): High-leverage points typically have a hat value $h_i \geq \frac{2(p+1)}{n}$,  
    - where 
        - p is the number of predictors and 
        - n is the sample size.
- Cook’s Distance and DFFITS: Identify points with excessive influence.

Implementing High Leverage Point Detection in Python
- Fit a linear regression model.
- Compute leverage scores.
- Identify high leverage points.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm

# Generate synthetic data
np.random.seed(42)
X = np.random.normal(10, 2, 100)  # Normally distributed predictor
Y = 3 * X + np.random.normal(0, 3, 100)  # Linear relationship with noise

# Introduce high leverage points
X[95:] = X[95:] + 20  # Shift last 5 values far from the rest

# Fit regression model
X_const = sm.add_constant(X)  # Add intercept
model = sm.OLS(Y, X_const).fit()

# Compute leverage scores (hat values)
influence = model.get_influence()
leverage = influence.hat_matrix_diag  # Extract leverage values

# Define threshold for high leverage
k = X_const.shape[1] - 1  # Number of predictors
n = len(X)
threshold = 2 * (k + 1) / n

# Identify high leverage points
high_leverage_points = np.where(leverage > threshold)[0]

# Plot leverage scores
plt.figure(figsize=(10, 5))
plt.scatter(range(n), leverage, label="Leverage Values", alpha=0.7)
plt.axhline(y=threshold, color='red', linestyle='dashed', label="Threshold")
plt.scatter(high_leverage_points, leverage[high_leverage_points], color='red', label="High Leverage Points", s=100)
plt.xlabel("Observation Index")
plt.ylabel("Leverage")
plt.title("Leverage Values and High Leverage Points")
plt.legend()
plt.show()

# Print high leverage points
print("High Leverage Points:", high_leverage_points)
print("Leverage Values of High Leverage Points:", leverage[high_leverage_points])


### How to Handle High Leverage Points?
Solutions:
- Assess model stability by running regression with and without the high-leverage points.

✅ If they are valid observations:
- Consider using robust regression (e.g., Ridge regression).
    - Use robust regression methods that downweight influential points.
- Verify if the observation provides valuable insight.

❌ If they are errors or extreme influential points:
- Investigate the cause.
- Consider removing or transforming the data.
    - Add interaction terms or transformations if high-leverage points reveal missing relationships.

# Overfitting and Underfitting in Linear Regression

### Analyzing the Relationship Between Model Error and Model Complexity
The relationship between model complexity and model error follows a well-known pattern often illustrated by the bias-variance tradeoff. As model complexity increases, error initially decreases but eventually rises again due to overfitting.

##### **Types of Model Error**
When model performs well on training data but not on the test data.

Model error consists of two main components:

**Bias** (Underfitting)

Bias is a measure to determine how accurate a model’s predictions are likely to be on future unseen data.
- Bias is errors made by training data.
    - Represents error due to overly simplistic assumptions.
        - Complex models, assuming there is enough training data available, can make accurate model predictions. 
        - Models that are too naive, are very likely to perform badly concerning model predictions.
    - High-bias models do not learn enough patterns from the training data.
- Linear algorithms have a high bias which makes them fast to learn and easier to understand but in general, are less flexible. 
    - Implying lower predictive performance on complex problems that fail to meet the expected outcomes.

Bias

📌 Definition: Bias refers to the error introduced by over-simplifying a machine learning model. A model with high bias makes strong assumptions about the data, leading to underfitting (poor performance on both training and test data).

🔹 Characteristics of High-Bias Models:
- Oversimplifies relationships in the data.
- Misses key patterns, leading to high training and test error.
- Performs poorly even on training data.

🔹 Example:
- Using linear regression to model a nonlinear relationship leads to high bias.
- A decision tree with depth = 1 (stump) cannot capture complex patterns.

**Variance** (Overfitting)

Variance is the sensitivity of the model towards training data
- it quantifies how much the model will react when input data is changed.
    - Represents sensitivity to small fluctuations in the training data.
    - model shouldn’t change too much from one training dataset to the next training data 
        - Whcih means that the algorithm is good at picking out the hidden underlying patterns between the inputs and the output variables.
    - model should have lower variance which means that the model doesn’t change drastically after changing the training data(it is generalizable). 
        - Having higher variance will make a model change drastically even on a small change in the training dataset.
    - High-variance models learn noise along with actual patterns.

Variance

📌 Definition: Variance refers to the error introduced by a model’s sensitivity to small variations in the training data. A model with high variance is too complex, capturing noise along with the actual patterns, leading to overfitting (low training error but high test error).

🔹 Characteristics of High-Variance Models:
- Memorizes training data, leading to low training error.
- Performs poorly on test data due to lack of generalization.
- Sensitive to small changes in the dataset.

🔹 Example:
- A deep decision tree (very high depth) memorizes training data but fails on test data.
- Polynomial regression of very high degree overfits noise in the data.

**Irreducible Error**

This is noise inherent in the data that no model can eliminate.
- Example: Measurement errors, random fluctuations in real-world processes.



In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

### Python Example: High Bias in Linear Regression

# Generate a non-linear dataset
np.random.seed(42)
X = np.linspace(-3, 3, 100).reshape(-1, 1)
y = X**2 + np.random.normal(0, 1, X.shape)

# Fit a linear regression model (high bias)
model = LinearRegression()
model.fit(X, y)
y_pred = model.predict(X)

# Plot results
plt.scatter(X, y, label="True Data")
plt.plot(X, y_pred, color="red", label="Linear Model (High Bias)")
plt.legend()
plt.show()

### Python Example: High Variance in Overfitting

# Fit a high-degree polynomial model (high variance)
poly = PolynomialFeatures(degree=10)
X_poly = poly.fit_transform(X)
high_var_model = LinearRegression()
high_var_model.fit(X_poly, y)
y_pred_high_var = high_var_model.predict(X_poly)

# Plot results
plt.scatter(X, y, label="True Data")
plt.plot(X, y_pred_high_var, color="red", label="High Variance Model")
plt.legend()
plt.show()

##### **Model Complexity vs. Error Curve (Bias Variance Tradeoff)**
The relationship follows a U-shaped curve:

A supervised machine learning algorithm seeks to strike a balance between low bias and low variance for increased robustness.

The relationship between bias and variance is characterized by an inverse correlation.
- Increased bias leads to reduced variance.
- Conversely, heightened variance results in diminished bias.
Finding an equilibrium between bias and variance is crucial, and algorithms must navigate this trade-off for optimal outcomes.

**Overfitting**

When a model learns every pattern and noise in the data to such an extent that it affects the performance of the model on the unseen future dataset.
- model fits the data so well that it interprets noise as patterns in the data.

Caused when a model has **low bias and higher variance** it ends up memorizing the data.

Overfitting causes the model to become specific rather than generic.
- Model captures noise in training data. 

This usually leads to: 
- high training accuracy and very low test accuracy.
    - Training error is low, but test error increases.

There are several ways to prevent overfitting:
- Cross-validation
- If the training data is too small to train add more relevant and clean data.
- If the training data is too large, do some feature selection and remove unnecessary features.
- Regularization

**Underfitting**

When the model fails to learn from the training dataset and is also not able to generalize the test dataset.
- Detected by the performance metrics.

When a model has **high bias and low variance** it ends up not generalizing the data and causing underfitting.
- Model is too simple and fails to capture patterns.
- It is unable to find the hidden underlying patterns in the data. 
- This usually leads to low training accuracy and very low test accuracy.
    - Training and test errors are both high.

Ways to prevent underfitting:
- Increase the model complexity
- Increase the number of features in the training data
- Remove noise from the data.

Ways to Preventing Overfitting in Machine Learning

Overfitting happens when a model learns not just the underlying patterns in the data but also the noise. This reduces its ability to generalize to new, unseen data. Below are effective techniques to prevent overfitting and improve model generalization.

1. Train with More Data (If Possible)

📌 Why? More data helps the model learn general patterns rather than noise.

🔹 Example: In fraud detection, adding more transaction data reduces bias from rare cases.

- How to check?
    - If test error is much higher than training error, your model might be overfitting.
    - Use learning curves (plot error vs. training size) to check if more data is needed.

2. Cross-Validation (K-Fold CV)

📌 Why? Ensures the model performs well on different subsets of the data.

🔹 How? Instead of a single train-test split, K-Fold CV splits data into K groups and trains the model K times, each time using a different fold for validation.

3. Regularization (L1, L2, Dropout)

📌 Why? Regularization adds a penalty to overly complex models, reducing overfitting.

- L1 (Lasso) and L2 (Ridge) Regularization for Regression
    - L1 (Lasso): Shrinks some coefficients to zero, performing feature selection.
    - L2 (Ridge): Reduces the magnitude of all coefficients, preventing extreme values.
    - Elastic Net: Combines L1 and L2 regularization
    - Dropout Regularization for Deep Learning
        - 📌 Why? Prevents a neural network from relying too much on certain neurons by randomly "dropping" them during training.

4. Feature Selection (Remove Irrelevant Features)

📌 Why? Irrelevant or highly correlated features increase model complexity unnecessarily.

🔹 How? Use techniques like correlation analysis, Lasso regression, or mutual information to select only the most important features.

5. Early Stopping (For Deep Learning)

📌 Why? Stops training when the validation error starts increasing, preventing overfitting.

🔹 How? Monitor the validation loss and stop training when it stops improving.

6. Data Augmentation (For Image Data)

📌 Why? Increases dataset diversity, helping models generalize better.

🔹 Example: In image classification, augmenting images by rotating, flipping, or changing brightness.

7. Ensemble Methods (Bagging & Boosting)

📌 Why? Combining multiple models reduces the impact of overfitting by balancing high variance and high bias.

Bagging (Bootstrap Aggregating)
- Uses random sampling to train multiple models and average their predictions.
- Example: Random Forest (aggregates multiple decision trees).

Boosting
- Trains models sequentially, where each model corrects errors of the previous one.
- Example: Gradient Boosting, XGBoost, LightGBM, AdaBoost.

**Optimal Complexity:**
- Balanced bias and variance.
- Model generalizes well to unseen data.
- Test error is minimized.

The ideal model must balance bias and variance to minimize total error.

| Model Complexity | Bias   | 	Variance       |	Generalization Error    |
|---------------|--------------------|----------------------------------| ---------------------|   
|Too Simple (e.g., Linear Regression for Nonlinear Data)|	High |	Low | High (Underfitting)  |
|Too Complex (e.g., High-Depth Decision Tree)|	Low |	High | High (Overfitting)  |
|Optimal Model (e.g., Regularized Model, Ensemble Learning)|	Moderate | Moderate |	Low|

Key Observations:
- Increasing model complexity reduces bias but increases variance.
- Simplifying a model reduces variance but increases bias.
- The sweet spot balances both to achieve low generalization error.

###  Strategies to Balance Bias and Variance

| Strategy  | Reduces Bias?  | 	Reduces Variance?      |	Best for    |
|---------------|--------------------|----------------------------------| ---------------------|   
| Regularization (L1, L2, Dropout)  |	❌ |	✅ | Overfitting (High Variance) |
| Ensemble Methods (Bagging & Boosting)|	❌ |	✅ | Overfitting (High Variance) |
| Feature Selection (Remove Irrelevant Features)|	❌ |	✅ | Overfitting (High Variance) |
| Training Data Increased | 	✅ |	✅ | Generalization Improvement |
| Cross-Validation (K-Fold CV) | 	✅ |	✅ | Model Evaluation |
| Simpler Model (Fewer Features, Lower Degree)|		✅ |	❌ | Underfitting (High Bias) |

##### Bias-Variance Tradeoff in Polynomial Regression

This example demonstrates how increasing model complexity affects error.

Key Observations from the Code
- Low-degree polynomial models (e.g., degree=1) → High training and test errors (underfitting).
- Medium-degree models (e.g., degree=3 to 5) → Low test error (optimal complexity).
- High-degree models (e.g., degree=10) → Very low training error but high test error (overfitting).

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error

# Generate synthetic dataset
np.random.seed(42)
X = np.linspace(0, 10, 100).reshape(-1, 1)
y = np.sin(X).ravel() + np.random.normal(scale=0.3, size=X.shape[0])

# Split into training and test sets
X_train, X_test = X[:80], X[80:]
y_train, y_test = y[:80], y[80:]

# Train models with different complexity levels
degrees = [1, 3, 5, 10]  # Increasing complexity
train_errors, test_errors = [], []

plt.figure(figsize=(10, 6))

for d in degrees:
    model = make_pipeline(PolynomialFeatures(d), LinearRegression())
    model.fit(X_train, y_train)

    # Predictions
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    # Calculate errors
    train_error = mean_squared_error(y_train, y_train_pred)
    test_error = mean_squared_error(y_test, y_test_pred)

    train_errors.append(train_error)
    test_errors.append(test_error)

    # Plot model fits
    plt.scatter(X_test, y_test, color="black", label="True data" if d == 1 else "")
    plt.plot(X_test, y_test_pred, label=f"Degree {d}")

plt.xlabel("X")
plt.ylabel("y")
plt.title("Effect of Model Complexity on Fit")
plt.legend()
plt.show()

# Plot error vs complexity
plt.figure(figsize=(8, 5))
plt.plot(degrees, train_errors, label="Training Error", marker="o")
plt.plot(degrees, test_errors, label="Test Error", marker="o")
plt.xlabel("Model Complexity (Polynomial Degree)")
plt.ylabel("Mean Squared Error")
plt.title("Bias-Variance Tradeoff")
plt.legend()
plt.show()


##### How to Control Model Complexity?
To find the right balance:
- Use cross-validation to estimate test error.
- Apply regularization techniques (Lasso, Ridge, Dropout for deep learning).
- Compare simpler vs. more complex models to select the best fit.

##### Techniques to Prevent Overfitting

| Technique | How It Helps   | Best Use Case           |
|---------------|--------------------|----------------------------------|  
|More Data	|Reduces variance by learning general patterns |	When data is limited|
|Cross-Validation	| Ensures model is robust to unseen data | Model selection. |
|Regularization	|Penalizes complexity (Lasso, Ridge, Dropout). |	Regression & Deep Learning.|
|Feature Selection	|Removes irrelevant or redundant features |	High-dimensional data|
|Early Stopping	| Stops training before overfitting occurs | Deep learning. |
|Data Augmentation	|Creates synthetic data to increase diversity |	Image processing|
|Ensemble Methods|Combines multiple models to reduce variance |Tree-based models|


Python Example: K-Fold Cross-Validation

🔹 Result: A more reliable performance estimate compared to a single train-test split.

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression

# Generate sample dataset
X, y = make_regression(n_samples=500, n_features=5, noise=10, random_state=42)

# Train model with 5-Fold Cross-Validation
model = LinearRegression()
scores = cross_val_score(model, X, y, cv=5, scoring='r2')

print("Cross-validation scores:", scores)
print("Mean CV Score:", scores.mean())


Python Example: Ridge and Lasso Regression

🔹 Effect: Helps reduce model complexity while maintaining good performance.

In [None]:
from sklearn.linear_model import Ridge, Lasso
from sklearn.model_selection import train_test_split

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train models with regularization
ridge = Ridge(alpha=1.0).fit(X_train, y_train)
lasso = Lasso(alpha=0.1).fit(X_train, y_train)

print("Ridge Coefficients:", ridge.coef_)
print("Lasso Coefficients:", lasso.coef_)  # Some coefficients will be zero


Dropout Regularization for Deep Learning

🔹 Effect: Reduces overfitting by forcing the model to learn redundant representations.

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout

# Define a simple neural network with dropout
model = Sequential([
    Dense(128, activation='relu', input_shape=(10,)),
    Dropout(0.5),  # Drops 50% of neurons randomly
    Dense(64, activation='relu'),
    Dropout(0.3),
    Dense(1, activation='sigmoid')
])


Python Example: Feature Importance with Random Forest

🔹 Effect: Helps in eliminating less useful features, improving generalization.

In [None]:
from sklearn.ensemble import RandomForestRegressor
import pandas as pd

# Train Random Forest model
model = RandomForestRegressor()
model.fit(X_train, y_train)

# Get feature importances
feature_importances = pd.Series(model.feature_importances_, index=["Feature1", "Feature2", "Feature3", "Feature4", "Feature5"])
print(feature_importances.sort_values(ascending=False))


Python Example: Early Stopping in Neural Networks

🔹 Effect: Saves computational time and improves generalization.

In [None]:
from tensorflow.keras.callbacks import EarlyStopping

early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)

model.fit(X_train, y_train, epochs=100, validation_data=(X_test, y_test), callbacks=[early_stopping])

Python Example: Image Augmentation with Keras

🔹 Effect: Reduces overfitting by exposing the model to different variations of the same data.

In [None]:
from tensorflow.keras.preprocessing.image import ImageDataGenerator

datagen = ImageDataGenerator(
    rotation_range=20,
    width_shift_range=0.2,
    height_shift_range=0.2,
    horizontal_flip=True
)

# Augment training data
augmented_images = datagen.flow(training_images, training_labels, batch_size=32)


Python Example: Random Forest vs. Gradient Boosting

🔹 Effect: Reduces variance and improves stability.

In [None]:
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

rf = RandomForestRegressor(n_estimators=100).fit(X_train, y_train)
gb = GradientBoostingRegressor(n_estimators=100).fit(X_train, y_train)

print("Random Forest Test Score:", rf.score(X_test, y_test))
print("Gradient Boosting Test Score:", gb.score(X_test, y_test))


### Fitting the model using `statsmodels.OLS`

`sklearn` is limited in terms of metrics and tools available to evaluate the appropriateness of the regression models we fit.
-As a means to expland our analysis, we import the `statsmodels` library which has a rich set of statistical tools to help us. 

##### Generating the regression string

Those of you familiar with the R language will know that fitting a machine learning model requires a sort of string of the form:

`y ~ X`

which is read as follows: "Regress y on X". The `statsmodels` library works in a similar way, so we need to generate an appropriate string to feed to the method when we wish to fit the model.

In [None]:
import statsmodels.formula.api as sm

In [None]:
df.describe().T

In [None]:
# Regress target variable on all of the predictors.
formula_str = df.columns[0]+' ~ '+'+'.join(df.columns[1:]); formula_str

In [None]:
# Importing seaborn library for visualizations
import seaborn as sns


# To plot all the scatterplots in a single plot
sns.pairplot(df, x_vars=[ 'TV', ' Newspaper','Radio' ], y_vars = 'Sales', size = 4, kind = 'scatter' )
plt.show()

##### Plotting 3D plot for multiple Linear regression

To get a better idea of what a multi-dimensional dataset looks like, we'll generate a 3D scatter plot showing the `mpg` on the _z_-axis (height), with two predictor variables, `cyl` and `disp` on the _x_- and _y_-axes.

In [None]:
# create figure and 3d axes
fig = plt.figure(figsize=(8,7))
ax = fig.add_subplot(111, projection='3d')

# set axis labels
ax.set_zlabel('MPG')
ax.set_xlabel('No. of Cylinders')
ax.set_ylabel('Weight (1000 lbs)')

# scatter plot with response variable and 2 predictors
ax.scatter(df['cyl'], df['wt'], df['mpg'])

We know that in simple linear regression (2D), any model that we fit to data manifests in the form of a straight line. Extending this idea to 3D, the line becomes a plane - a flat surface which is chosen to minimise the squared vertical distances between each observation (red dots), and the plane, as shown in the figure below from ISLR.

<img src="https://github.com/Explore-AI/Public-Data/raw/master/3D%20regression%20ISLR.jpg" alt="plane" style="width: 450px"/>

The result of a multivariate linear regression in higher dimensionality is known as a _hyperplane_ - similar to the flat surface in the figure above, but in a _p_-dimensional space, where $p>3$. Unfortunately, humans lack the ability to visualise any number of dimensions greater than three - so we have to be content with the idea that a hyperplane in _p_-dimensional space is effectively like a flat surface in 3-dimensional space.

In [None]:
# To plot heatmap to find out correlations
sns.heamap(df.corr(), cmap = 'YlGnBl', annot = True )
plt.show()

### Fitting the Multivariate Regression Model

In `sklearn`, fitting a multiple linear regression model is much the same as fitting a simple linear regression. This time, of course, our $X$ contains multiple columns, where it only contained one before. 

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, train_size = 0.7, test_size = 0.3, random_state = 100 )

In [None]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

### Construct and fit the model

We now go ahead and fit our model.
- use the `ols` or Ordinary Least Squares regression model from the `statsmodels` library

In [None]:
import statsmodels.api as sm

In [None]:
# Add a constant to get an intercept
X_train_sm = sm.add_constant(X_train)
# Fit the resgression line using 'OLS'
lr = sm.OLS(y_train, X_train_sm).fit()

# OR

model=sm.ols(formula=formula_str, data=df1)
fitted = model.fit()

In [None]:
# Print the parameters,i.e. intercept and slope of the regression line obtained
lr.params

# extract model intercept
beta_0 = float(lr.intercept_)

# extract model coeffs
beta_js = pd.DataFrame(lr.coef_, X.columns, columns=['Coefficient'])
beta_js

### Interpreting Coefficients of Multilinear Regression

In a multilinear regression model, the coefficients represent the relationship between 
- each predictor (independent variable) and 
- the response (dependent variable), 
while controlling for the effects of other predictors in the model.

Intercept ($𝛽_0$ | `beta_0`):
- This is the predicted value of the response variable when all predictors are set to zero.
- It is meaningful only if all predictors can realistically take a value of zero.

Slope Coefficients ($𝛽_𝑖$ | `beta_js`):
- Each $𝛽_𝑖$ measures the change in the response variable for a one-unit increase in predictor $𝑋_𝑖$, assuming all other predictors remain constant.
- A positive $𝛽_𝑖$: Indicates that an increase in $𝑋_𝑖$ is associated with an increase in the response.
- A negative $𝛽_𝑖$: Indicates that an increase in $𝑋_𝑖$ is associated with a decrease in the response.

P-Values:
- A p-value tests the null hypothesis that the coefficient $𝛽_{1}$ is zero (no effect). 
    - If the p-value is small (typically <0.05), the predictor is considered statistically significant in explaining the response variable.

Standardized Coefficients:
- If predictors are measured in different units, their coefficients can't be directly compared. Standardized coefficients (beta weights) are used to determine the relative importance of predictors.

##### Explaining Multilinear Regression equation

$$ 𝑦=𝛽_{0}+𝛽_{1}𝑥_{1}+…+𝛽_{2}𝑥_{2}+ 𝜖 $$

- $𝛽_{1}$: if $𝛽_{1}$ = 2, then a one-unit increase in $𝑥_{1}$ is associated with an average increase of 2 units in 𝑌, holding $𝑥_{2}$ constant.
- $𝛽_{2}$: if $𝛽_{2}$ = -3, then a one-unit increase in $𝑥_{2}$ is associated with an average decrease of 3 units in 𝑌, holding $𝑥_{1}$ constant.

### Testing Relationships Between Response and Predictors

Multilinear regression tests the relationship between the response variable (𝑌) and the predictors ($𝑥_{1}$,$𝑥_{2}$,…,$𝑥_{p}$) by modeling 𝑌 as a linear combination of the predictors:

$$ 𝑦=𝛽_{0}+𝛽_{1}𝑥_{1}+ +𝛽_{2}𝑥_{2}+…+𝛽_{p}𝑥_{p}+ 𝜖 $$

1. Hypothesis Testing:
- For each predictor $𝑥_{p}$, Null Hypothesis ($𝐻_0): 𝛽_p =0$ (the predictor has no effect on response(𝑦) varaible | no relationship between predictor(x) response(𝑦)).
- Alternative Hypothesis $(𝐻_𝑎): 𝛽_𝑗 ≠ 0$ (the predictor has an effect/ there is a relationship).

2. **t-statistic: test is performed for each coefficient**

How to Calculate the t-statistic in Linear regression

The t-statistic in linear regression measures how many standard errors the estimated coefficient is away from zero. 
- It is used for hypothesis testing to determine if a predictor variable is statistically significant.

The formula to calculate the t-statistic for a coefficient

$$t = \frac{\hat{𝛽_p}}{SE_{\hat{𝛽_p}}}$$

Where:
$𝛽_p$: Estimated coefficient (e.g., slope or intercept).
$SE_{\hat{𝛽_p}}$: Standard error of the estimated coefficient $\hat{𝛽_p}$.

### t-statistic maybe a misleading variable importance indicator:

In multiple linear regression, the t-statistic evaluates the significance of individual predictor variables by testing the null hypothesis that a predictor's coefficient is zero ($𝐻_0): 𝛽_p =0$.

It can be misleading as an indicator of variable importance in multilinear regression for the following reasons:

- Multicollinearity
    - When predictor variables are highly correlated, the variance of the coefficient estimates increases.
    - This can lead to inflated standard errors and reduced t-statistics, causing variables to appear insignificant even if they are important.
        - Conversely, some variables might have significant t-statistics due to correlation with other predictors rather than their actual contribution to the response variable.

- Dependency on Units of Measurement
    - The t-statistic depends on the scale of the predictor variables. 
        - For example, variables with larger numerical ranges can dominate, making direct comparisons between t-statistics across variables inappropriate without standardization.

- Context of the Model
- The importance of a variable depends on the context of other predictors in the model. 
    - Adding or removing predictors can change the coefficients and t-statistics, leading to different conclusions about importance.

- Does Not Reflect Contribution to $R^2$
    - The t-statistic evaluates the statistical significance of a single variable, but it does not measure its contribution to the model's overall explanatory power ($R^2$).
    - A variable may be statistically significant (high t-statistic) yet contribute little to the variance explained.

- Focuses on Statistical Significance Over Practical Significance
    - A high t-statistic indicates statistical significance but does not imply that the variable is practically meaningful or contributes substantially to predictions.

Best Practices to Assess Variable Importance
- Use metrics like standardized coefficients to account for differences in units.
- Evaluate variable importance metrics, such as partial $R^2$ , Shapley values, or permutation importance, especially in models with multicollinearity.
- Perform model comparison using adjusted $R^2$ or the Akaike Information Criterion (AIC) to assess the model’s explanatory power with and without specific variables.

##### Implementation of best practices for assessing variable importance in multilinear regression:

- Standardized Coefficients: Calculates coefficients on a standardized scale for comparison.
- Partial $R^2$: Measures the contribution of each variable to the overall $R^2$.
- Permutation Importance: Evaluates the change in model performance when a variable's values are randomly shuffled.

In [None]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.inspection import permutation_importance

# Example data
np.random.seed(42)
X = pd.DataFrame({
    'Variable_A': np.random.rand(100) * 100,
    'Variable_B': np.random.rand(100) * 50,
    'Variable_C': np.random.rand(100) * 10
})
y = 2 * X['Variable_A'] + 0.5 * X['Variable_B'] + 0.1 * X['Variable_C'] + np.random.randn(100) * 5

# Step 1: Fit a linear regression model
model = LinearRegression()
model.fit(X, y)

# Step 2: Assess importance using standardized coefficients
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
model_scaled = LinearRegression()
model_scaled.fit(X_scaled, y)
standardized_coefficients = model_scaled.coef_

# Step 3: Compute partial R-squared for each variable
def partial_r2(X, y, variable):
    X_partial = X.drop(columns=[variable])
    model_partial = LinearRegression().fit(X_partial, y)
    residuals = y - model_partial.predict(X_partial)
    total_rss = np.sum((y - y.mean()) ** 2)
    partial_rss = np.sum(residuals ** 2)
    return 1 - (partial_rss / total_rss)

partial_r2_values = {var: partial_r2(X, y, var) for var in X.columns}

# Step 4: Compute permutation importance
perm_importance = permutation_importance(model, X, y, n_repeats=30, random_state=42)

# Step 5: Display results
print("Standardized Coefficients:")
for var, coef in zip(X.columns, standardized_coefficients):
    print(f"{var}: {coef:.4f}")

print("\nPartial R-squared Values:")
for var, r2 in partial_r2_values.items():
    print(f"{var}: {r2:.4f}")

print("\nPermutation Importance:")
for var, importance in zip(X.columns, perm_importance.importances_mean):
    print(f"{var}: {importance:.4f}")


2. **F-Test for Overall Model Significance**:

The F-statistic is used in hypothesis testing to evaluate the overall significance of a multiple linear regression model. Specifically, it tests whether at least one of the predictor variables in the model significantly explains variation in the dependent variable.
- Tests the null hypothesis that all coefficients are zero ($𝛽_{1} = 𝛽_{2} = ... = 𝛽_{p} = 0$).
    - If the F-statistic is significant, at least one predictor has a relationship with 𝑌.

formula for the F-statistic is:

$$ F= \frac{Explained Mean Square (MSR)}{Residual Mean Square (MSE)}$$
$$ F= \frac{\frac{TSS−RSS}{p}}{\frac{RSS}{n−p−1}}$$

Where:
- TSS: Total Sum of Squares
- RSS: Residual Sum of Squares
- n: Number of observations
- p: Number of predictors (excluding the intercept)
- Mean Square Regression (MSR): $\frac{TSS−RSS}{p}$
- Mean Square Error (MSE): $frac{RSS}{n−p−1}$

Steps:

1. Calculate the degrees of freedom:
- For regression: 𝑝
- For error: 𝑛−𝑝−1

2. Compute the explained variance: TSS−RSS

3. Calculate Mean Square Regression (MSR) and Mean Square Error (MSE)

4. Compute the F-statistic:
$$F = \frac{Explained Mean Square (MSR)}{Residual Mean Square (MSE)}$$

When to Perform the F-Test?
- Perform the F-test whenever you have a regression model and want to evaluate its overall significance. 
- It is especially relevant in multiple linear regression with several predictors.

Why Perform the F-Test?
- To determine if the model as a whole is useful for predicting the dependent variable.
- It helps decide whether further analysis (e.g., testing individual predictors or refining the model) is warranted.

##### Practical Steps in Hypothesis Testing:

i.  Formulate the Hypotheses
- Null Hypothesis ($𝐻_0$): All regression coefficients (except the intercept) are equal to zero, i.e., the predictors do not explain the variability in the dependent variable.
$$ 𝐻_0: 𝛽_{1} = 𝛽_{2} = ... = 𝛽_{p} = 0$$
- Alternative Hypothesis($𝐻_a$): At least one of the regression coefficients is not zero, i.e., at least one predictor contributes to explaining the variability.
$$ 𝐻_a: at least one𝛽_{j}\neq 0, for j = 1,2,...,p$$

ii. Calculate the F-Statistic

$$ F= \frac{Explained Mean Square (MSR)}{Residual Mean Square (MSE)}$$
$$ F= \frac{\frac{TSS−RSS}{p}}{\frac{RSS}{n−p−1}}$$

where:

$$MSR = \frac{Explained Variance}{Degrees of Freedom for Regression (df_{reg})}$$

$$MSE = \frac{Residual Sum of Squares (RSS)}{Degrees of Freedom for Error (df_{error})}$$

iii. Determine the Degrees of Freedom

- $df_{reg}$ =p: Number of predictors.
- $df_{error}$ =n−p−1: Residual degrees of freedom, where 𝑛 is the number of observations.

iv. Find the Critical Value

- Use the F-distribution table or Python to find the critical value for the given 𝛼 (commonly 0.05), $df_{reg}$ and $df_{error}$

v. Compare F-Statistic with the Critical Value
- If $𝐹 > 𝐹_{𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙}$, reject $𝐻_0$. 
    - This implies that at least one predictor is significant.
- If $𝐹 ≤ 𝐹_{𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙}$, fail to reject $𝐻_0$. 
    - This implies the predictors do not collectively explain the variability better than random chance.

vi. Use the p-Value (Optional)
- Instead of using a critical value, you can calculate the p-value associated with the F-statistic:
    - If p-value < α, reject $𝐻_0$.
    - If p-value ≥ α, fail to reject $𝐻_0$.

Interpreting Results
- Significant F-statistic: Indicates the model has predictive power and at least one predictor is meaningful.
- Non-significant F-statistic: Suggests the model does not explain variability better than a simple mean-based model.

In [None]:
# Example parameters
TSS = 1200  # Total Sum of Squares
RSS = 300   # Residual Sum of Squares
n = 50      # Number of observations
p = 3       # Number of predictors (excluding the intercept)

# Step 1: Degrees of freedom
df_regression = p               # Degrees of freedom for regression
df_error = n - p - 1           # Degrees of freedom for error

# Step 2: Explained variance
explained_variance = TSS - RSS

# Step 3: Calculate MSR and MSE
MSR = explained_variance / df_regression  # Mean Square Regression
MSE = RSS / df_error                     # Mean Square Error

# Step 4: Calculate the F-statistic
F_statistic = MSR / MSE

# Step 5: Perform hypothesis testing
import scipy.stats as stats

# Calculate the critical value for the F-distribution
alpha = 0.05  # Significance level
F_critical = stats.f.ppf(1 - alpha, df_regression, df_error)

# Calculate the p-value for the F-statistic
p_value = 1 - stats.f.cdf(F_statistic, df_regression, df_error)

# Print the results
print(f"Degrees of Freedom (Regression): {df_regression}")
print(f"Degrees of Freedom (Error): {df_error}")
print(f"Explained Variance: {explained_variance}")
print(f"Mean Square Regression (MSR): {MSR}")
print(f"Mean Square Error (MSE): {MSE}")
print(f"F-Statistic: {F_statistic}")
print(f"Critical F-Value: {F_critical}")
print(f"P-Value: {p_value}")

# Decision based on F-statistic
if F_statistic > F_critical:
    print("Reject the null hypothesis: At least one predictor is significant.")
else:
    print("Fail to reject the null hypothesis: The model is not significant.")


3. Assessing Fit:
- Coefficient of Determination $R^2$ : 
    - Proportion of variance in 𝑌 explained by the predictors.
    - Purpose: Measures the proportion of variance in the dependent variable explained by the independent variables.
    - When to Use: Always, as a baseline measure of model fit.
- Adjusted $R^2$: 
    - Adjusts $R^2$ for the number of predictors, penalizing the inclusion of irrelevant predictors.
    - Key Consideration: Adjusted $R^2$ accounts for the number of predictors, providing a better measure for models with multiple variables.
- Residual Analysis
    - Purpose: Examines the residuals (differences between observed and predicted values) to check assumptions of the regression model.
    - How to Use:
        - Plot residuals vs. predicted values to check for patterns (should appear random).
        - Use a histogram or Q-Q plot of residuals to check normality.
        - Examine residuals vs. independent variables to check for independence.
    - When to Use: Always, to validate assumptions like linearity, homoscedasticity, and normality.
- Mean Squared Error (MSE)
    - Purpose: Measures the average squared difference between observed and predicted values.
    - When to Use: To quantify model error; lower MSE indicates better fit.
- F-Statistic
    - Purpose: Tests the overall significance of the model by comparing explained variance to unexplained variance.
    - When to Use: To test whether at least one predictor is significant in explaining the variance of the dependent variable.
- Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC)
    - Purpose: Compare models, penalizing for model complexity.
    - When to Use: When comparing multiple models with different numbers of predictors or structures.
- Cross-Validation
    - Purpose: Evaluates the model’s performance on unseen data.
    - How to Use:
        - Use k-fold cross-validation to divide data into training and test sets.
        - Calculate metrics (e.g., $R^2$ , MSE) on test sets.
    - When to Use: To assess model generalizability.
- Variance Inflation Factor (VIF)
    - Purpose: Detects multicollinearity among predictors.
    - How to Use: Compute VIF for each predictor; values > 10 indicate high multicollinearity.
    - When to Use: To assess stability of coefficient estimates.
- Cook’s Distance and Leverage
    - Purpose: Identifies influential observations that disproportionately affect the regression results.
    - How to Use:
        - Cook’s Distance: Observations with values > 1 are considered influential.
        - Leverage: High-leverage points have significant potential to influence the model.
    - When to Use: To identify outliers and influential data points.
- Normalized Residual Standard Error (NRSE)
    - Purpose: Provides a standardized measure of error in the model.
    - When to Use: To compare models with different dependent variable scales.
- Predictive Metrics (e.g., RMSE, MAE)
    - Purpose: Evaluate model accuracy in predicting outcomes.
    - When to Use: For regression models focused on prediction.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.api import OLS, add_constant

# Example dataset
np.random.seed(42)
X = pd.DataFrame({
    'Variable_A': np.random.rand(100) * 100,
    'Variable_B': np.random.rand(100) * 50,
    'Variable_C': np.random.rand(100) * 10
})
y = 2 * X['Variable_A'] + 0.5 * X['Variable_B'] + 0.1 * X['Variable_C'] + np.random.randn(100) * 5

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit a linear regression model
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# 1. Coefficient of Determination (R^2)
r2 = r2_score(y_test, y_pred)
adjusted_r2 = 1 - (1 - r2) * (len(y_test) - 1) / (len(y_test) - X_test.shape[1] - 1)
print(f"R^2: {r2:.4f}, Adjusted R^2: {adjusted_r2:.4f}")

# 2. Residual Analysis
residuals = y_test - y_pred
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
sns.scatterplot(x=y_pred, y=residuals)
plt.axhline(0, color='red', linestyle='--')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residuals vs. Predicted')

plt.subplot(1, 2, 2)
sns.histplot(residuals, kde=True)
plt.xlabel('Residuals')
plt.title('Distribution of Residuals')
plt.show()

# 3. Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse:.4f}")

# 4. F-Statistic (using statsmodels)
X_train_const = add_constant(X_train)
ols_model = OLS(y_train, X_train_const).fit()
print(ols_model.summary())

# 5. Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC)
print(f"AIC: {ols_model.aic:.4f}, BIC: {ols_model.bic:.4f}")

# 6. Cross-Validation
cv_scores = cross_val_score(model, X, y, cv=5, scoring='r2')
print(f"Cross-Validation R^2 Scores: {cv_scores}")
print(f"Mean CV R^2: {np.mean(cv_scores):.4f}")

# 7. Variance Inflation Factor (VIF)
vif_data = pd.DataFrame()
vif_data['Variable'] = X.columns
vif_data['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print("Variance Inflation Factor (VIF):")
print(vif_data)

# 8. Cook's Distance and Leverage
influence = ols_model.get_influence()
cooks_d = influence.cooks_distance[0]
high_influence_points = np.where(cooks_d > 4 / len(X_train))[0]
print(f"High Influence Points (Cook's Distance > 4/n): {high_influence_points}")

# 9. Residual Standard Error (RSE)
rse = np.sqrt(mse)
print(f"Residual Standard Error (RSE): {rse:.4f}")

# 10. Predictive Metrics (RMSE and MAE)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mae = np.mean(np.abs(y_test - y_pred))
print(f"Root Mean Squared Error (RMSE): {rmse:.4f}, Mean Absolute Error (MAE): {mae:.4f}")


In [None]:
############## Step 1: Import Libraries and Load Data
import pandas as pd
import numpy as np
import statsmodels.api as sm
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

# Generate example dataset
np.random.seed(42)
data = {
    "SquareFootage": np.random.uniform(500, 4000, 100),
    "Bedrooms": np.random.randint(1, 6, 100),
    "DistanceFromCityCenter": np.random.uniform(1, 20, 100),
    "HousePrice": np.random.uniform(50000, 500000, 100),
}

df = pd.DataFrame(data)

# Print sample data
print(df.head())

############# Step 2: Fit the Multilinear Regression Model

# Define predictors (X) and response (Y)
X = df[["SquareFootage", "Bedrooms", "DistanceFromCityCenter"]]
Y = df["HousePrice"]

# Add a constant for the intercept
X = sm.add_constant(X)

# Fit the model
model = sm.OLS(Y, X).fit()

# Summary of the model
print(model.summary())

############## Step 3: Interpret the Output

# Performing a summary operation lists out all different parameters of the regression line fitted
print(lr.summary())

# OR

print(fitted.summary())

##### Interpretation of `summary`

The model.summary() provides:

Coefficients:
- $𝛽_{0}$: The intercept.
- $𝛽_{1}$,$𝛽_{2}$,$𝛽_{3}$: Coefficients for predictors.

P-values:
- Assess the significance of each predictor.
  - If 𝑝 < 0.05, the predictor significantly explains variations in the response variable.

$R^{2}$ and Adjusted $R^{2}$ :
- Measure how much variance in the response is explained by the predictors.

**F-statistic**:
- Tests the overall significance of the model.
- It tests whether at least one predictor variable in the model has a non-zero coefficient, meaning it contributes significantly to explaining the variance in the dependent variable.

In [None]:
# Example parameters
TSS = 1200  # Total Sum of Squares
RSS = 300   # Residual Sum of Squares
n = 50      # Number of observations
p = 3       # Number of predictors (excluding the intercept)

# Step 1: Degrees of freedom
df_regression = p               # Degrees of freedom for regression
df_error = n - p - 1           # Degrees of freedom for error

# Step 2: Explained variance
explained_variance = TSS - RSS

# Step 3: Calculate MSR and MSE
MSR = explained_variance / df_regression  # Mean Square Regression
MSE = RSS / df_error                     # Mean Square Error

# Step 4: Calculate the F-statistic
F_statistic = MSR / MSE

# Print the results
print(f"Degrees of Freedom (Regression): {df_regression}")
print(f"Degrees of Freedom (Error): {df_error}")
print(f"Explained Variance: {explained_variance}")
print(f"Mean Square Regression (MSR): {MSR}")
print(f"Mean Square Error (MSE): {MSE}")
print(f"F-Statistic: {F_statistic}")


4. Assumptions:
- Linearity: The relationship between predictors and response is linear.
- Independence of Errors: Errors are independent of each other.
- Homoscedasticity: Constant variance of errors.
- Normality of Errors: Errors are normally distributed.

##### Practical Steps:
1. Plot residuals to check assumptions.
2. Use statistical tests (e.g., Shapiro-Wilk for normality, Breusch-Pagan for homoscedasticity).
3. Apply transformations or alternative models if assumptions are violated.

In [None]:
############### Step 4: Visualize Residuals to Check Assumptions

# Linearity and Homoscedasticity
# Plot predicted vs actual values
predicted = model.predict(X)
residuals = Y - predicted

plt.scatter(predicted, residuals)
plt.axhline(0, color="red", linestyle="--")
plt.xlabel("Predicted Values")
plt.ylabel("Residuals")
plt.title("Residuals vs Predicted Values")
plt.show()

# Normality of Errors
# Plot residual distribution
sns.histplot(residuals, kde=True)
plt.title("Residual Distribution")
plt.show()

# Perform Shapiro-Wilk test for normality
from scipy.stats import shapiro
shapiro_test = shapiro(residuals)
print(f"Shapiro-Wilk test p-value: {shapiro_test.pvalue}")


Step 5: Interpretation

Coefficient Interpretation
- if coefficient of , $𝛽_{1}$ = 50 it means that for every additional square foot, the house price increases by $50, assuming other predictors are held constant.

Model Fit
- if $R^{2}$ = 0.85, it means 85% of the variance in house prices is explained by the predictors.
- Check adjusted $R^{2}$ to ensure added predictors improve the model meaningfully.

Assumptions
- A residual plot with no pattern confirms linearity.
- Homoscedasticity: Residuals should have constant variance (scatter evenly around zero).
- Normality: Residuals should approximately follow a normal distribution.
___________

few 2-dimensional plots; plotting `wt`, `disp`, `cyl`, and `hp` vs. `mpg`, respectively (top-left to bottom-right).

In [None]:
fig, axs = plt.subplots(2, 2, figsize=(9,7))

axs[0,0].scatter(df['wt'], df['mpg'])
axs[0,0].plot(df['wt'], lm.intercept_ + lm.coef_[4]*df['wt'], color='red')
axs[0,0].title.set_text('Weight (wt) vs. mpg')

axs[0,1].scatter(df['disp'], df['mpg'])
axs[0,1].plot(df['disp'], lm.intercept_ + lm.coef_[1]*df['disp'], color='red')
axs[0,1].title.set_text('Engine displacement (disp) vs. mpg')

axs[1,0].scatter(df['cyl'], df['mpg'])
axs[1,0].plot(df['cyl'], lm.intercept_ + lm.coef_[0]*df['cyl'], color='red')
axs[1,0].title.set_text('Number of cylinders (cyl) vs. mpg')

axs[1,1].scatter(df['hp'], df['mpg'])
axs[1,1].plot(df['hp'], lm.intercept_ + lm.coef_[2]*df['hp'], color='red')
axs[1,1].title.set_text('Horsepower (hp) vs. mpg')

fig.tight_layout(pad=3.0)

plt.show()

### Assessing Model Accuracy

Let's assess the fit of our multivariate model. For the purpose of a rudimentary comparison, let's measure model accuracy aginst a simple linear regression model.

In [None]:
# Add a constant to X_test
X_test_sm = sm.add_constant(X_test)
# Predict the y values corresponding to X_test_sm
y_pred = lr.predict(X_test_sm)

We have included a column *Test RMSE*, which is simply the square root of the *Test MSE*.


\begin{align}
RMSE & = \sqrt{MSE} \\
     & = \sqrt{\frac{1}{N}\sum^{N} (\hat{y_i} - y_i)^{2}}
\end{align}

Where $y_i$ are the actual target values for a dataset with $N$ datapoints, and $\hat{y_i}$ represent our corresponding predictions. RMSE is a more intuitive metric to use than MSE because it is in the same units as the underlying variable being predicted.

In [None]:
from sklearn import metrics
import math

# Imporitng libraries
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

# dictionary of results
results_dict = {'Training MSE':
                    {
                        "SLR": metrics.mean_squared_error(y_train, slr.predict(X_train[['disp']])),
                        "MLR": metrics.mean_squared_error(y_train, lm.predict(X_train))
                    },
                'Test MSE':
                    {
                        "SLR": metrics.mean_squared_error(y_test, slr.predict(X_test[['disp']])),
                        "MLR": metrics.mean_squared_error(y_test, lm.predict(X_test))
                    },
                'Test RMSE':
                    {
                        "SLR": math.sqrt(metrics.mean_squared_error(y_test, slr.predict(X_test[['disp']]))),
                        "MLR": math.sqrt(metrics.mean_squared_error(y_test, lm.predict(X_test)))
                    }
                }

In [None]:
#RMSE value
print("RMSE: ",np.sqrt(mean_squared_error(y_test, y_pred))
#R-squared value
print("R-squared: ",r2_score(y_test, y_pred))

In [None]:
X_train_lm = X_train_lm.values.reshape(-1,1)
X_test_lm = X_test_lm.values.reshape(-1,1)

In [None]:
print(X_train_lm.shape)
print(X_train_lm.shape)

In [None]:
from sklearn.linear_model import LinearRegression
#Representing LinearRegression as lr (creating LinearRegression object)
lr = LinearRegression()
#Fit the model using lr.fit()
lr.fit(X_train_lm,y_train_lm)

In [None]:
#get intercept
print(lr.intercept_)
#get slope
print(lr.coef_)

# Addressing Assumptions in Multilinear Regression

### **When using Linear Regression, several potential problems can arise that violate its assumptions**
This leads to biased or inefficient estimates. Below are key issues related to 
- non-linearity, 
- correlation of error terms, 
- non-constant variances of error terms, 
- outliers, 
- high-leverage points, and 
- collinearity—along with how to detect and address them.

Initial Diagnostics:
- Examine scatter plots and residual plots.
- Test assumptions (e.g., Breusch-Pagan for heteroscedasticity, Shapiro-Wilk for normality).

Transform Data if Necessary:
- Use log, Box-Cox, or polynomial transformations to address issues like non-linearity and heteroscedasticity.

Refit and Compare Models:
- Use metrics like Adjusted $𝑅^2$ , Akaike Information Criterion (AIC), and Bayesian Information Criterion (BIC) to compare models.

Document Interpretations:
- Explain coefficients in the context of transformed variables.
- Discuss any trade-offs made during model selection.

# Checking for Independence

Independence of Errors: Errors should be independent (important for time series or clustered data).

We have done checks for linearity and multicollinearity, which both referred to the predictor variables. 

To checking some of the artefacts of the fitted model for three more statistical phenomena which further help us determine its quality.

#### Residuals vs. Predictor Variables Plots 

The first check we do involves plotting the residuals (vertical distances between each data point and the regression hyperplane). 
- We are looking to confirm the independence assumption here, i.e.: the residuals should be independent. 

If they are we will see:
- Residuals approximately uniformly randomly distributed about the zero x-axes;
- Residuals not forming specific clusters.

Observing the plots two things should be relatively clear:

- Residuals are slightly to skewed to the positive or negative (reaching +5 but only about -3);

- check for clustering, 
    - Check which may present a cluster on the value 6.

Conclusion: is the residuals are largely independent?

In [None]:
fig, axs = plt.subplots(2,5, figsize=(14,6),sharey=True)
fig.subplots_adjust(hspace = 0.5, wspace=.2)
fig.suptitle('Predictor variables vs. model residuals', fontsize=16)
axs = axs.ravel()

for index, column in enumerate(df.columns):
    axs[index-1].set_title("{}".format(column),fontsize=12)
    axs[index-1].scatter(x=df[column],y=fitted.resid,color='blue',edgecolor='k')
    axs[index-1].grid(True)
    xmin = min(df[column])
    xmax = max(df[column])
    axs[index-1].hlines(y=0,xmin=xmin*0.9,xmax=xmax*1.1,color='red',linestyle='--',lw=3)
    if index == 1 or index == 6:
        axs[index-1].set_ylabel('Residuals')

# Correlation of Error Terms (Autocorrelation)
Issue:
- Errors (residuals) in a regression model should be independent. 
    - If errors are correlated (autocorrelation), the model violates the assumption of independent errors, leading to underestimated standard errors and inflated R² values.

Detection:
- Durbin-Watson Test: Values close to 2 suggest no autocorrelation, while values near 0 or 4 indicate positive or negative autocorrelation.
- Residual Plots: Plot residuals over time—patterns or trends suggest autocorrelation.

Solutions:
- Use lag variables to capture dependencies in time series data.
- Apply Generalized Least Squares (GLS) or use a Newey-West estimator to correct for autocorrelation.
-  Consider ARIMA models for time-dependent data.

### Effect of Correlated Errors on Standard Errors and Confidence Intervals
When the error terms in a regression model are correlated, the assumptions of ordinary least squares (OLS) regression are violated. 
- This can lead to biased standard errors and misleading confidence intervals, which in turn affect hypothesis testing and model interpretation.

#### 1. How Correlated Errors Affect Standard Errors
- Standard errors estimate the variability of the regression coefficients.
- When errors are correlated (also known as autocorrelation in time series or spatial correlation in spatial data), standard errors are underestimated or overestimated.
- Underestimated standard errors: The model falsely appears more precise than it actually is, increasing the likelihood of Type I errors (incorrectly rejecting the null hypothesis).
- Overestimated standard errors: The model appears less precise, making it harder to detect significant relationships.

Key Impact: The formula for standard errors assumes independence. 
- If errors are correlated, the estimated variance is incorrect, leading to misleading inferences.

Effect on Confidence Intervals
- Confidence intervals depend on standard errors:
$$ CI = \hat{\beta} \pm t_{\frac{\alpha}{2}} \cdot SE(\hat{\beta}) $$
- When standard errors are underestimated, confidence intervals are too narrow, making the model falsely appear more certain.
- When standard errors are overestimated, confidence intervals are too wide, reducing the power to detect real effects.

Key Impact: The true confidence level differs from what is reported. 
- A 95% confidence interval might not actually contain the true parameter 95% of the time.

Visualizing Correlated Errors in Python
We will:
- Simulate a dataset with uncorrelated errors (ideal case).
- Introduce correlated errors (violating OLS assumptions).
- Compare standard errors and confidence intervals.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm

# Set seed for reproducibility
np.random.seed(42)

# Generate predictor variable
X = np.linspace(1, 100, 100)

# Generate **uncorrelated errors** (white noise)
errors_uncorrelated = np.random.normal(0, 10, 100)
y_uncorrelated = 2.5 * X + errors_uncorrelated

# Generate **correlated errors** (introducing autocorrelation)
errors_correlated = np.zeros(100)
rho = 0.8  # Degree of correlation

for t in range(1, 100):
    errors_correlated[t] = rho * errors_correlated[t-1] + np.random.normal(0, 10)

y_correlated = 2.5 * X + errors_correlated

# Fit OLS regression models
X_const = sm.add_constant(X)
model_uncorrelated = sm.OLS(y_uncorrelated, X_const).fit()
model_correlated = sm.OLS(y_correlated, X_const).fit()

# Compare standard errors of coefficients
print("Standard Error of Coefficients:")
print("Uncorrelated Errors:", model_uncorrelated.bse)
print("Correlated Errors:", model_correlated.bse)

# Compare confidence intervals
print("\nConfidence Intervals:")
print("Uncorrelated Errors:\n", model_uncorrelated.conf_int())
print("Correlated Errors:\n", model_correlated.conf_int())

# Plot residuals to visualize correlation
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.plot(errors_uncorrelated, label="Uncorrelated Errors", color="blue")
plt.title("Uncorrelated Errors")
plt.legend()

plt.subplot(1, 2, 2)
plt.plot(errors_correlated, label="Correlated Errors", color="red")
plt.title("Correlated Errors")
plt.legend()
plt.show()

Interpretation of Results

Standard Errors Comparison
- The standard error is larger for correlated errors because variance is underestimated in OLS.
- If errors are correlated, we get false precision (low SE) or inflated uncertainty (high SE).

Confidence Interval Comparison
- With correlated errors, the confidence interval is incorrect (either too narrow or too wide).
- The interval might suggest false confidence in parameter estimates, leading to incorrect conclusions.

Solutions for Correlated Errors
1. Check for correlation using Durbin-Watson test:
    - DW ≈ 2 → No correlation.
    - DW < 1 or > 3 → Strong correlation.

In [None]:
from statsmodels.stats.stattools import durbin_watson

dw_stat = durbin_watson(model_correlated.resid)
print(f"Durbin-Watson Statistic: {dw_stat}")

2. Use Generalized Least Squares (GLS): Handles correlated errors better than OLS.

3. Use HAC (Heteroskedasticity and Autocorrelation Consistent) Standard Errors:

In [None]:
model_hac = model_correlated.get_robustcov_results(cov_type="HAC", maxlags=1)
print(model_hac.summary())

# Checking for Homoscedasticity

Homoscedasticity is an important assumption in linear regression. It implies that the variance of the residuals (errors) is constant across all levels of the independent variables. When this assumption is violated (heteroscedasticity), the model's standard errors, and p-values can become unreliable, potentially leading to incorrect inferences.

Homoscedasticity: Residuals should have constant variance.

What is Heteroscedasticity?
- Heteroscedasticity occurs when the variance of the error terms (residuals) is not constant across all levels of an independent variable in a regression model. 
- This violates a key assumption of Ordinary Least Squares (OLS) regression, which assumes that residuals have constant variance (homoscedasticity).

Why is Heteroscedasticity a Problem?
- Biased Standard Errors → Leads to incorrect hypothesis testing.
- Unreliable Confidence Intervals → False statistical inferences.
- Inefficient Estimates → OLS no longer provides the best linear unbiased estimators (BLUE).

What needs to be done: Check whether the variance of the residuals (the error terms) is constant as the fitted values increase. 

#### Nonconstant Variance of Error Terms (Homoscedasticity vs. Heteroscedasticity)
Issue:
- Linear regression assumes homoscedasticity—that the variance of errors remains constant across all levels of the independent variables. 
- If error variance changes, the model becomes inefficient, affecting confidence intervals and hypothesis tests.

Detection:
- Residual vs. Fitted Value Plot: A funnel shape suggests heteroscedasticity.
- Breusch-Pagan or White’s Test: Formal statistical tests for nonconstant variance.

Solutions:
- Use log or square root transformations on the dependent variable.
- Apply Weighted Least Squares (WLS) regression.
- Use robust standard errors to adjust for heteroscedasticity.

#### Fitted vs. Residuals

Determine this by plotting the magnitude of the fitted values (i.e.: `mpg`) against the residuals. 
- What we are looking for is the plotted points to approximately form a rectangle.
- The magnitude of the residuals should not increase as the fitted values increase (if that is the case, the data will form the shape of a cone on its side).

**Observation**
- If the variance is constant, we have observed _homoscedasticity_. 
- If the variance is not constant, we have observed _heteroscedasticity_. 

Use the same plot to check for outliers: any plotted points that are visibly seperate from the random pattern of the rest of the residuals.

**Observation**
- Look at data point on particular side of the plot and observe the scatteredness/ density.
    - Points towards the right-hand side of the plot tend to be scattered slightly less densely, indicating the presence of heteroscedasticity.
    - This violates our assumption of homoscedasticity. 
- Look at the presesnce of outliers
    - The presence of these outliers means that those values are weighted too heavily in the prediction process, disproportionately influencing the model's performance. 
    - This in turn can lead to the confidence interval for out of sample predictions (unseen data) being unrealistically wide or narrow.

if Heteroscedasticity, 
- Solution: Use transformations (log, Box-Cox) or weighted least squares regression.

**Step 1: Diagnosing Heteroscedasticity**

Detecting Heteroscedasticity
- Before applying transformations, confirm the presence of heteroscedasticity:
    - Residual plots: Plot residuals vs. fitted values.
    - Breusch-Pagan test: A statistical test for heteroscedasticity.
    - White test: A general test for heteroscedasticity.

(a) Residual Plot
- Plot the residuals against the predicted values to check for patterns.

Interpretation:
- If the points are randomly scattered, homoscedasticity is likely satisfied.
- A funnel-shaped or other pattern suggests heteroscedasticity.

In [None]:
plt.figure(figsize=(8,3))
p=plt.scatter(x=fitted.fittedvalues,y=fitted.resid,edgecolor='k')
xmin = min(fitted.fittedvalues)
xmax = max(fitted.fittedvalues)
plt.hlines(y=0,xmin=xmin*0.9,xmax=xmax*1.1,color='red',linestyle='--',lw=3)
plt.xlabel("Fitted values",fontsize=15)
plt.ylabel("Residuals",fontsize=15)
plt.title("Fitted vs. residuals plot",fontsize=18)
plt.grid(True)
plt.show()

In [None]:
import matplotlib.pyplot as plt

# Predicted values and residuals
predicted = model.predict(X)
residuals = Y - predicted

# Residual plot
plt.scatter(predicted, residuals)
plt.axhline(0, color="red", linestyle="--")
plt.title("Residual Plot")
plt.xlabel("Predicted Values")
plt.ylabel("Residuals")
plt.show()

(b) Breusch-Pagan Test
- This statistical test explicitly checks for heteroscedasticity.

Interpretation:
- Null Hypothesis: Homoscedasticity is present.
- If p-value < 0.05, reject the null hypothesis, indicating heteroscedasticity.

In [None]:
from statsmodels.stats.diagnostic import het_breuschpagan

# Breusch-Pagan test
bp_test = het_breuschpagan(residuals, X)
print("Breusch-Pagan Test Results:")
print(f"LM Statistic: {bp_test[0]}")
print(f"p-value: {bp_test[1]}")


In [None]:
# Breusch-Pagan test
from statsmodels.stats.diagnostic import het_breuschpagan

bp_test = het_breuschpagan(residuals, X)
print(f"Breusch-Pagan test p-value: {bp_test[1]}")

(c) White Test
- Another test for heteroscedasticity, more flexible than Breusch-Pagan.

In [None]:
from statsmodels.stats.diagnostic import het_white

# White test
white_test = het_white(residuals, X)
print("White Test Results:")
print(f"LM Statistic: {white_test[0]}")
print(f"p-value: {white_test[1]}")


**Step 2: Addressing Heteroscedasticity/ Handling Homoscedasticity**

1. Mitigating Heteroscedasticity Using Data Transformation: Transforming the Response Variable 
- Apply transformations to stabilize variance.
- Applying transformations to the dependent (Y) or independent (X) variables can stabilize variance and improve model accuracy.

Common Data Transformations for Heteroscedasticity:
| Transformation | When to Use | Breed_Bulldog |
|---------|----------------|---------------|
| Log Transformation | When variance increases with larger values | $Y^* = log(Y)$      |
| Square Root Transformation | When variance grows moderately  | $Y^* = \sqrt{Y}$       |
| Box-Cox Transformation| Generalized approach for different degrees of heteroscedasticity | $Y^* = \frac{Y^{\lambda} - 1}{\lambda}$ |
| Inverse Transformation| When larger values have very high variance | $Y^* = \frac{1}{Y}$ |

(a) Log Transformation: Use when variance increases with the response variable.

In [None]:
df["Log_HousePrice"] = np.log(df["HousePrice"])
model_log = sm.OLS(df["Log_HousePrice"], X).fit()
print(model_log.summary())

In [None]:
Y_log = np.log(Y)

# Fit new model
model_log = sm.OLS(Y_log, X_const).fit()

# Plot residuals
plt.scatter(model_log.fittedvalues, model_log.resid, alpha=0.7)
plt.axhline(y=0, color='red', linestyle='dashed')
plt.xlabel("Fitted Values")
plt.ylabel("Residuals")
plt.title("Residual Plot (Log Transformation)")
plt.show()


(b) Box-Cox Transformation: Automatically finds the best transformation.

In [None]:
from scipy.stats import boxcox

Y_transformed, lambda_boxcox = boxcox(Y)
print(f"Optimal lambda for Box-Cox: {lambda_boxcox}")

model_boxcox = sm.OLS(Y_transformed, X).fit()
print(model_boxcox.summary())


In [None]:
# Box-Cox transformation (requires strictly positive Y)
Y_bc, lambda_bc = boxcox(Y - np.min(Y) + 1)  # Shift Y to be positive

# Fit new model
model_bc = sm.OLS(Y_bc, X_const).fit()

# Plot residuals
plt.scatter(model_bc.fittedvalues, model_bc.resid, alpha=0.7)
plt.axhline(y=0, color='red', linestyle='dashed')
plt.xlabel("Fitted Values")
plt.ylabel("Residuals")
plt.title(f"Residual Plot (Box-Cox Transformation, λ={lambda_bc:.2f})")
plt.show()

(c) Square Root Transformation

In [None]:
Y_sqrt = np.sqrt(Y)

# Fit new model
model_sqrt = sm.OLS(Y_sqrt, X_const).fit()

# Plot residuals
plt.scatter(model_sqrt.fittedvalues, model_sqrt.resid, alpha=0.7)
plt.axhline(y=0, color='red', linestyle='dashed')
plt.xlabel("Fitted Values")
plt.ylabel("Residuals")
plt.title("Residual Plot (Square Root Transformation)")
plt.show()

Implementing Transformations in Python

In summarry, we will:
- Generate heteroscedastic data.
- Apply log, square root, and Box-Cox transformations.
- Compare residual plots.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from scipy.stats import boxcox
from statsmodels.stats.diagnostic import het_breuschpagan

# Set seed for reproducibility
np.random.seed(42)

# Generate independent variable
X = np.linspace(1, 100, 100)

# Generate heteroscedastic errors (variance increases with X)
errors = np.random.normal(0, X, 100)  

# Create heteroscedastic response variable
Y = 5 * X + errors  

# Fit OLS model
X_const = sm.add_constant(X)
model = sm.OLS(Y, X_const).fit()

# Check for heteroscedasticity (Breusch-Pagan test)
bp_test = het_breuschpagan(model.resid, X_const)
print("Breusch-Pagan Test p-value:", bp_test[1])

# Plot original residuals
plt.figure(figsize=(10, 5))
plt.scatter(model.fittedvalues, model.resid, alpha=0.7)
plt.axhline(y=0, color='red', linestyle='dashed')
plt.xlabel("Fitted Values")
plt.ylabel("Residuals")
plt.title("Residual Plot (Before Transformation)")
plt.show()

2. Applying Weighted Least Squares (WLS)

If heteroscedasticity is detected:
- Use WLS to assign weights inversely proportional to the variance of residuals.

When to use:
- When residual patterns vary predictably with certain predictors.

In [None]:
import statsmodels.api as sm
import numpy as np

# Calculate weights as inverse of squared residuals
weights = 1 / (residuals**2)

# Fit WLS model
model_wls = sm.WLS(Y, X, weights=weights).fit()
print(model_wls.summary())

In [None]:
# Fit a weighted least squares model
weights = 1 / (residuals**2)
model_wls = sm.WLS(Y, X, weights=weights).fit()

print(model_wls.summary())


3. Heteroscedasticity-Robust Standard Errors
- Use robust standard errors to correct inference without changing the model structure.

Types of Robust Covariance:
- "HC0": Basic robust variance.
- "HC1", "HC2", "HC3": Variants of robust variance, with "HC3" being stricter.

In [None]:
# Fit OLS model with robust standard errors
model_robust = sm.OLS(Y, X).fit(cov_type="HC3")
print(model_robust.summary())

Check Residual Plots After Mitigation

In [None]:
# Residuals of WLS model
predicted_wls = model_wls.predict(X)
residuals_wls = Y - predicted_wls

plt.scatter(predicted_wls, residuals_wls)
plt.axhline(0, color="red", linestyle="--")
plt.title("Residual Plot After WLS")
plt.xlabel("Predicted Values")
plt.ylabel("Residuals")
plt.show()

**Step 3: Comparing Models**

Evaluate and Compare Performance
- Residual plots before and after adjustments.
- Metrics like $𝑅^2$, AIC, and BIC.

In [None]:
# Compare AIC and BIC
print(f"Original Model AIC: {model.aic}")
print(f"Log-Transformed Model AIC: {model_log.aic}")
print(f"Box-Cox Model AIC: {model_boxcox.aic}")
print(f"WLS Model AIC: {model_wls.aic}")

Best Practices and Considerations

Diagnosis:
- Always check residual plots and use tests like Breusch-Pagan or White.

Correction:
- Start with transformations if patterns suggest non-linearity or skewed responses.
- Use WLS or robust standard errors for complex variance structures.

Validation:
- Ensure improvements in residual plots and metrics.
- Balance interpretability and complexity when applying advanced techniques.

# Checking for Normality

The normality of residuals is a key assumption in linear regression, especially for inference. It ensures that t-tests and F-tests for significance are valid. If residuals are not normally distributed, it can lead to unreliable p-values and confidence intervals.

To confirm our assumption of normality amongst the residuals. 
- If the residuals are non-normally distributed, confidence intervals can become too wide or too narrow, 
    - which leads to difficulty in estimating coefficients based on the minimisation of ordinary least squares.

Check for violation of the normality assumption in two different ways:
1. Plotting a histogram of the normalised residuals;
2. Generating a Q-Q plot of the residuals.

**Step 1: Testing for Normality**

(a) Visual Inspection: Histogram and Q-Q Plot

1. Histogram: Examine the residuals' distribution.

Plot a histogram of the residuals to take a look at their distribution. 
- It is fairly easy to pick up when a distribution looks similar to the classic _bell curve_ shape of the normal distribution.

Interpretation:
- Histogram: A bell-shaped curve suggests normality.

In [None]:
plt.figure(figsize=(8,5))
plt.hist(fitted.resid_pearson,bins=8,edgecolor='k')
plt.ylabel('Count',fontsize=15)
plt.xlabel('Normalized residuals',fontsize=15)
plt.title("Histogram of normalized residuals",fontsize=18)
plt.show()

In [None]:
import matplotlib.pyplot as plt
import scipy.stats as stats

# Calculate residuals
residuals = Y - model.predict(X)

# Histogram
plt.hist(residuals, bins=20, edgecolor='k', alpha=0.7)
plt.title("Residual Histogram")
plt.xlabel("Residuals")
plt.ylabel("Frequency")
plt.show()

2. Q-Q plot of the residuals

Compare residuals to a normal distribution.
- A Q-Q plot, A.K.A quantile-quantile plot, attempts to plot the theoretical quantiles of the standard normal distribution against the quantiles of the residuals. 
- The one-to-one line, indicated in red below, is the ideal line indicating normality. 
- The closer the plotted points are to the red line, the closer the residual distribution is to the standard normal distribution.

Quantiles are cut points dividing the range of a probability distribution into continuous intervals with equal probabilities or dividing the observations in a sample in the same way.
- 2 quantile is known as the Median
- 4 quantile is known as the Quartile
- 10 quantile is known as the Decile
- 100 quantile is known as the Percentile

Interpretation:
- Q-Q Plot: Points should lie close to the 45° line for normality.

10 quantile will divide the Normal Distribution into 10 parts each having 10 % of the data points. The Q-Q plot or quantile-quantile plot is a scatter plot created by plotting two sets of quantiles against one another.

In [None]:
# We once again use the statsmodel library to assist us in producing our qqplot visualisation. 
from statsmodels.graphics.gofplots import qqplot

In [None]:
plt.figure(figsize=(8,5))
fig=qqplot(fitted.resid_pearson,line='45',fit='True')
plt.xticks(fontsize=13)
plt.yticks(fontsize=13)
plt.xlabel("Theoretical quantiles",fontsize=15)
plt.ylabel("Sample quantiles",fontsize=15)
plt.title("Q-Q plot of normalized residuals",fontsize=18)
plt.grid(True)
plt.show()

In [None]:
import matplotlib.pyplot as plt
import scipy.stats as stats

# Calculate residuals
residuals = Y - model.predict(X)

# Q-Q plot
stats.probplot(residuals, dist="norm", plot=plt)
plt.title("Q-Q Plot of Residuals")
plt.show()

(b) Shapiro-Wilk Test
- Statistical test for normality.

Interpretation

Null Hypothesis: Residuals follow a normal distribution.
- If p-value < 0.05, reject the null hypothesis, indicating non-normality.

In [None]:
from scipy.stats import shapiro

shapiro_test = shapiro(residuals)
print(f"Shapiro-Wilk Test Statistic: {shapiro_test.statistic}, p-value: {shapiro_test.pvalue}")

(c) Kolmogorov-Smirnov Test
- Another test for normality.

In [None]:
from scipy.stats import kstest

ks_test = kstest(residuals, 'norm', args=(residuals.mean(), residuals.std()))
print(f"KS Test Statistic: {ks_test.statistic}, p-value: {ks_test.pvalue}")

(d) Anderson-Darling Test
- Tests for how well data fits a specific distribution.

Compare the statistic to critical values. 
- If the statistic exceeds the critical value for a given significance level, residuals deviate from normality.

In [None]:
from scipy.stats import anderson

anderson_test = anderson(residuals, dist="norm")
print("Anderson-Darling Test Results:")
print(f"Statistic: {anderson_test.statistic}")
print("Critical Values:", anderson_test.critical_values)

**Step 2: Addressing Non-Normal Residuals/ Handling Normality of Errors**

(a) Transform the Response Variable

1. Log Transformation: Use if residuals are right-skewed.

In [None]:
import numpy as np

df["Log_HousePrice"] = np.log(df["HousePrice"])
model_log = sm.OLS(df["Log_HousePrice"], X).fit()
print(model_log.summary())

2. Applying Box-Cox Transformation
- Finds the best transformation parameter (𝜆).

In [None]:
from scipy.stats import boxcox

# Apply Box-Cox to the response variable
Y_boxcox, lambda_boxcox = boxcox(Y)
print(f"Optimal lambda for Box-Cox: {lambda_boxcox}")

# Fit the model again
model_boxcox = sm.OLS(Y_boxcox, X).fit()
print(model_boxcox.summary())

In [None]:
from scipy.stats import boxcox

Y_transformed, lambda_boxcox = boxcox(Y)
print(f"Optimal lambda for Box-Cox: {lambda_boxcox}")

model_boxcox = sm.OLS(Y_transformed, X).fit()
print(model_boxcox.summary())


3. Square Root Transformation: Helps stabilize variance and normalize data.

In [None]:
df["Sqrt_HousePrice"] = np.sqrt(df["HousePrice"])
model_sqrt = sm.OLS(df["Sqrt_HousePrice"], X).fit()
print(model_sqrt.summary())

(b) Using Robust Regression

If normality cannot be achieved:
- Robust regression minimizes the influence of outliers and non-normal errors.

1. Huber Regression: Combines linear regression with robustness to outliers.

In [None]:
from sklearn.linear_model import HuberRegressor

huber = HuberRegressor()
huber.fit(X, Y)
print("Huber Coefficients:", huber.coef_)

2. Quantile Regression: Models conditional medians instead of means.

In [None]:
import statsmodels.api as sm

model_quantile = sm.QuantReg(Y, X).fit(q=0.5)  # Median regression
print(model_quantile.summary())

3. Robust linear model:

In [None]:
from statsmodels.robust.robust_linear_model import RLM

# Fit a robust linear model
model_robust = sm.RLM(Y, X).fit()
print(model_robust.summary())

(c) Bootstrap for Non-Normal Residuals
- Bootstrapping creates confidence intervals without assuming normality.

In [None]:
from sklearn.utils import resample
import numpy as np

# Bootstrap residuals
bootstrap_samples = 1000
boot_means = []

for _ in range(bootstrap_samples):
    Y_boot, X_boot = resample(Y, X)
    model_boot = sm.OLS(Y_boot, X_boot).fit()
    boot_means.append(model_boot.params)

boot_means = np.array(boot_means)
print("Bootstrap Confidence Intervals:")
print(np.percentile(boot_means, [2.5, 97.5], axis=0))


**Step 3: Evaluating Adjustments**

Evaluate and Compare Performance
- Residual plots before and after adjustments.
- Normality tests on new residuals.
- Performance Metrics like $𝑅^2$, AIC, and BIC.

In [None]:
print(f"Original Model AIC: {model.aic}")
print(f"Log-Transformed Model AIC: {model_log.aic}")
print(f"Box-Cox Model AIC: {model_boxcox.aic}")

Plot Residuals After Adjustments:

In [None]:
# Residuals from Box-Cox model
residuals_boxcox = Y_transformed - model_boxcox.predict(X)

plt.hist(residuals_boxcox, bins=20, edgecolor='k', alpha=0.7)
plt.title("Residual Histogram After Box-Cox")
plt.xlabel("Residuals")
plt.ylabel("Frequency")
plt.show()

Testing:
- Use visual methods like histograms and Q-Q plots.
- Statistical tests (Shapiro-Wilk, Anderson-Darling) confirm non-normality.

Correction:
- Start with transformations like log or Box-Cox.
- Use robust regression if transformations fail or residuals deviate significantly.

Validation:
- Reassess residual plots and metrics post-adjustment.
- Ensure the model aligns with assumptions.

# Checking for Outliers in Residuals

Check for outliers amongst the residuals.

#### Plotting Cook's Distance

Cook's distance is a calculation which measures the effect of deleting an observation from the data. 
- Observations with large Cook's distances should be earmarked for closer examination in the analysis due to their disproportionate impact on the model.

**Observation**

Check values with much higher Cook's distances than the rest. 
- A rule of thumb for determining whether a Cook's distance is too large is whether it is greater than four times the mean Cook's distance.

In [None]:
from statsmodels.stats.outliers_influence import OLSInfluence as influence

In [None]:
inf=influence(fitted)

In [None]:
(c, p) = inf.cooks_distance
plt.figure(figsize=(8,5))
plt.title("Cook's distance plot for the residuals",fontsize=16)
plt.stem(np.arange(len(c)), c, markerfmt=",", use_line_collection=True)
plt.grid(True)
plt.show()

#### Calculate the mean Cooks Distance

Check which observation are 4 X higher the the average

Implications: Highly influential in this dataset
- warrant closer examination.

In [None]:
print('Mean Cook\'s distance: ', c.mean())

## 3. Logistic Regression

Logistic regression is the appropriate regression analysis to conduct when the dependent variable is dichotomous (binary).

It is used to describe data and to explain the relationship between one dependent binary variable and one or more 
- nominal, 
- ordinal, 
- interval or 
- ratio-level independent variables.

Logistic regression is a statistical model used for binary classification tasks.
- The outcome variable is categorical with two possible values (e.g., 1/0, Yes/No, Positive/Negative).
- Used to predict the Probabilities for classification problems.

It predicts the probability of an event occurring, transforming the linear combination of predictors through a logistic function (sigmoid function) to ensure the predicted probabilities lie between 0 and 1.

Model Equation: 
$ 𝑃(𝑦=1)= \frac{1}{1+𝑒^{−(𝛽_{0}+𝛽_{1}𝑥_{1}+…+𝛽_{𝑛}𝑥_{𝑛}})}$

**What It Means:** 
- Logistic regression estimates the probability of a binary outcome (e.g., yes/no, success/failure) based on predictor variables. 
    - It uses a logistic function to map predictions to probabilities between 0 and 1.

- It is a statistical technique for investigating the relationship between a binary dependent variable (outcome) and one or more independent variables (predictors). 

- The goal of logistic regression is to find the best-fitting model to describe the relationship between the dependent variable and the independent variables and then use that model to predict the outcome variable.

**Lay Explanation:**
- Logistic regression is like a yes-or-no decision helper. It estimates the chances of an event happening (e.g., a customer buying a product) based on known factors.
- It tries to find the best-fitted curve for the data

**Why use Logistic Regression rather than Linear Regression?**

Outlier Influence:
- best fit line in linear regression shifts to fit that point.

Predicted outcome out of range:
- In linear regression, the predicted values may be out of range.

Response Variable:
- Linear regression is used when dependent variable is continuous
- Logistic Regression is used when our dependent variable is binary.

Logistic regression is ideal for this problem because:
- Binary Outcome: The target variable is binary: Readmitted (1) or Not Readmitted (0).
- Interpretability: It provides coefficients (log odds) that indicate how changes in predictors affect the likelihood of the event (readmission).
- Insights: It helps identify the significant factors influencing readmissions.

### Outcome Interpretation: 
- The model outputs probabilities that can be converted to binary outcomes. 
- Coefficients show how each predictor variable influences the likelihood of the outcome.

### Performance Measures:
- Accuracy: Proportion of correct predictions.
- AUC-ROC: Measures the model's ability to distinguish between classes; values closer to 1 indicate a better model.

### Types of Logistic Regression

#### Binary Logistic Regression
Binary logistic regression is used to predict the probability of a binary outcome, such as 
- yes or no, 
- true or false, or 
- 0 or 1. 

For example, it could be used to:
- predict whether a customer will churn or not, 
- predict whether a patient has a disease or not, or 
- predict whether a loan will be repaid or not.

#### Multinomial Logistic Regression
Multinomial logistic regression is used to predict the probability of one of three or more possible outcomes, such as 
- the type of product a customer will buy, 
- the rating a customer will give a product, or 
- the political party a person will vote for.

#### Ordinal Logistic Regression
Used to predict the probability of an outcome that falls into a predetermined order, such as 
- the level of customer satisfaction, 
- the severity of a disease, or 
- the stage of cancer.

#### Least-Squares Regression 
Is a foundational method in statistics and data science for modeling relationships between variables, particularly for continuous dependent variables. 
- It does so by finding the line (or hyperplane in higher dimensions) that minimizes the sum of the squared differences (residuals) between the observed and predicted values of the dependent variable.

Application:
- Continuous Outcomes: Least-squares regression is most commonly used for problems where the dependent variable is continuous, such as 
    - predicting house prices, 
    - stock prices, or 
    - blood pressure.
- Exploratory Analysis: Identifying potential relationships between variables.

##### Drawback of least-squares regression
When applied to classification tasks like logistic regression, is that it assumes linearity and can lead to problems when modeling binary or categorical outcomes.

Key Issue:

1. Inappropriate Predictions
- Least-squares regression is designed for continuous outcomes and does not restrict predictions to the range [0, 1], which is required for probabilities in classification problems.
- For binary classification, it can result in predictions outside the valid probability range, such as negative values or values greater than 1, which are meaningless.

2. Violation of Assumptions
- The error terms (residuals) in least-squares regression are assumed to be normally distributed and homoscedastic (constant variance). 
    - However, in classification problems, these assumptions are violated because:
        - The dependent variable is not continuous but binary.
        - The variance of the binary response variable is a function of the mean (heteroscedasticity), not constant.

3. Inefficient Parameter Estimation (relationship between the predictors and the binary outcome)
- Linear least squares does not model this relationship (non-linear relationship between predictors and the outcome) correctly .
    - As a result, least squares is inefficient in estimating parameters and may lead to biased coefficients.
- In classification tasks, the relationship between the predictors and the binary outcome is often non-linear (sigmoid-shaped in logistic regression). 

4. Poor Performance for Separation
- Least-squares regression does not inherently maximize the separation between the two classes in binary classification problems. 
- Logistic regression, on the other hand, maximizes the likelihood of the observed data, providing a more suitable objective for classification tasks.

5. Susceptibility to Outliers
- Least-squares regression is sensitive to outliers, as it minimizes the squared residuals. 
- In a classification context, outliers in the feature space can have a disproportionately large influence on the model, leading to poor generalization.

##### Why Logistic Regression Instead of Least Squares?
Logistic regression overcomes these drawbacks by:
- Modeling the probability of the binary outcome using the logit function (log-odds), ensuring probabilities stay within [0, 1].
- Using maximum likelihood estimation (MLE) to fit the model, which aligns with the probabilistic nature of classification problems.
- Making no assumptions about normally distributed errors, as it focuses on the Bernoulli distribution of binary outcomes.


##### Differences Between Linear and Logistic Regression
The core difference lies in their target predictions.
- Linear regression excels at predicting continuous values along a spectrum. 
    - resulting output would be a specific amount, a continuous value on the amount scale.
- Linear regression answers “how much” questions, providing a specific value on a continuous scale.

- Logistic regression deals with categories. 
    - It doesn’t predict a specific value but rather the likelihood of something belonging to a particular class.
    - output here would be a probability between 0 (not likely spam) and 1 (very likely spam). 
    - This probability is then used to assign an email to a definitive category (spam or not spam) based on a chosen threshold.
- Logistic regression tackles “yes or no” scenarios, giving the probability of something belonging to a certain category.

### Problem Statement
Objective:
- The medical institute, we want to identify the likelihood of patients being readmitted within 30 days of discharge based on patient 
    - demographics, 
    - medical history, 
    - length of stay (LOS), and 
    - clinical metrics such as blood pressure, 
    - blood glucose levels, and 
    - medication adherence.

**Key Assumptions of Logistic Regression**

Data Specific
- Binary Outcome: The dependent variable is binary.
    - Logistic regression is designed for binary dependent variables. 
    - If your outcome has more than two categories, you might need a multinomial logistic regression or other classification techniques.
- Independence of Observations: Observations are independent of each other.
    -  This means no repeated measurements or clustering within the data.

Relationship Between Variables
- Linearity of Log-Odds: There is a linear relationship between the log-odds of the outcome and the independent variables.
    - Outcome itself has a relationship with log-odds.
    - Outcome does not have linear relationship with the independent variables.
- No Multicollinearity: Independent variables are not highly correlated.
    - Multicollinearity can cause instability in the model and make it difficult to interpret the coefficients.

Other
- Large Sample Size: Logistic regression performs well with larger datasets.
    - To ensure reliable parameter estimates.
- Absence of Outliers: outliers can significantly influence the model. 
    - It’s important to check for and address any outliers that might distort the results.

**Step 1: Define the Problem**
- Target Variable: Readmission within 30 days (1 = Yes, 0 = No).
- Predictors:
    - Patient Demographics: Age, gender, insurance status.
    - Clinical Metrics: Blood glucose levels, blood pressure, medication adherence.
    - Hospital Metrics: Length of Stay (LOS), number of previous visits.

**Step 2: Collect and Prepare Data**
- Gather historical patient data and ensure it's clean and consistent.
    - Check for Missing Data:
    - Impute missing values for predictors like glucose levels using median or mean.
    - Standardize Continuous Variables:
    - Standardize LOS, glucose levels, and blood pressure for consistency.

In [None]:
# Example dataset
data = pd.DataFrame({
    'age': [45, 60, 50, 40, 70],
    'los': [3, 7, 4, 2, 10],
    'glucose': [150, 200, 180, 140, 220],
    'med_adherence': [0.8, 0.6, 0.75, 0.9, 0.5],
    'readmitted': [1, 1, 0, 0, 1]
})

# Features and target
X = data[['age', 'los', 'glucose', 'med_adherence']]
y = data['readmitted']

# Add constant for intercept
X = sm.add_constant(X)

**Step 3: Exploratory Data Analysis**
- Univariate Analysis: Examine distributions of continuous variables.
- Bivariate Analysis: Analyze relationships between predictors and the target variable.
- Correlation Matrix: Identify multicollinearity among predictors.

**Step 4: Perform Logistic Regression**

How logistic regression squeezes the output of linear regression between 0 and 1.

Best Fit Equation in Linear Regression

$ y = 𝛽_{0}+𝛽_{1}𝑥_{1}$

Now we want to take probabilities (P) instead of y.

**Issue**: 
the value of (P) will exceed 1 or go below 0 and we know that range of Probability is (0-1)

Odds and log-odds are central to understanding the relationship between predictors and the probability of an event occurring.

**Overcome issue of $0 < P < 1$**

by taking “odds” of P:

Odds: The odds represent the ratio of the probability of an event occurring (P) to the probability of it not occurring (1−P).

$$ Odds =  \frac{P}{1-P}$$

Log-Odds (Logit): The natural logarithm of the odds.

$$ Log-Odds =  \log(\frac{P}{1-P})$$

In logistic regression, the log-odds are modeled as a linear function of the predictors:

$$ P = 𝛽_{0}+𝛽_{1}𝑥_{1}$$
$$ \frac{P}{1-P} = 𝛽_{0}+𝛽_{1}𝑥_{1}$$

Odds can always be positive which means the range will always be ($0,+∞ $).
- Odds are the ratio of the probability of success and probability of failure.

Why ‘odds’?
- odds are probably the easiest way to do this.

Problem: is that the range is restricted and we don’t want a restricted range because if we do so then our correlation will decrease.
- By restricting the range we are actually decreasing the number of data points and if we decrease our data points, our correlation will decrease.
- Making it difficult to model a variable that has a restricted range.

Control:
- Control this we take the log of odds which has a range from (-∞,+∞)

$ \log(\frac{P}{1-P}) = 𝛽_{0}+𝛽_{1}𝑥_{1}$

Now we just want a function of P because we want to predict probability not log of odds. To do so we will 
- multiply by exponent on both sides and then solve for P.

$ \exp[\log(\frac{P}{1-P})] = \exp(𝛽_{0}+𝛽_{1}𝑥_{1})$

$ \exp^{\ln[\frac{P}{1-P})} = \exp^{(𝛽_{0}+𝛽_{1}𝑥_{1})} $

$ \frac{P}{1-P} = \exp^{(𝛽_{0}+𝛽_{1}𝑥_{1})} $

$ p = \exp^{(𝛽_{0}+𝛽_{1}𝑥_{1})}  - p\exp^{(𝛽_{0}+𝛽_{1}𝑥_{1})}$

Now we have sigmoid function.

Model Equation: 
$ 𝑃(𝑦=1)= \frac{1}{1+𝑒^{−(𝛽_{0}+𝛽_{1}𝑥_{1}+…+𝛽_{𝑛}𝑥_{𝑛}})}$

It squeezes a straight line into an S-curve.

In [None]:
import numpy as np

# Function to calculate the sigmoid function
def sigmoid(z):
    """
    The sigmoid function maps log-odds to probabilities between 0 and 1.
    """
    return 1 / (1 + np.exp(-z))

# Function to calculate odds and log-odds
def logistic_regression_predict(X, coefficients):
    """
    Predict probabilities, odds, and log-odds using logistic regression.
    
    Parameters:
    - X: Feature matrix (numpy array of shape [n_samples, n_features])
    - coefficients: Coefficients including intercept (numpy array of shape [n_features + 1])
    
    Returns:
    - probabilities: Predicted probabilities (numpy array of shape [n_samples])
    - odds: Odds of event occurring (numpy array of shape [n_samples])
    - log_odds: Log-Odds (numpy array of shape [n_samples])
    """
    # Add intercept to the feature matrix
    X = np.hstack((np.ones((X.shape[0], 1)), X))  # Add a column of ones for the intercept
    
    # Calculate log-odds (z = X * coefficients)
    log_odds = np.dot(X, coefficients)
    
    # Calculate probabilities using the sigmoid function
    probabilities = sigmoid(log_odds)
    
    # Calculate odds
    #  Derived from probabilities using the formula
    odds = probabilities / (1 - probabilities)
    
    return probabilities, odds, log_odds

# Example usage
# Example dataset: X contains two features, and coefficients include intercept and weights
X = np.array([[2, 3], [1, 0], [4, 5]])  # Feature matrix
coefficients = np.array([-3, 0.5, 1])  # Coefficients (intercept + weights for features)

# Predict probabilities, odds, and log-odds
probabilities, odds, log_odds = logistic_regression_predict(X, coefficients)

# Predicted Probabilities: Likelihood of the event occurring.
# Odds: Ratio of the probability of success to failure.
# Log-Odds: Linear transformation of the predictors.

# Print results
print("Predicted Probabilities:", probabilities)
print("Odds:", odds)
print("Log-Odds:", log_odds)

**log-odds linear function**

The log-odds linear function is a core concept in logistic regression and represents the relationship between the independent variables (predictors) and the log-odds of the dependent variable (outcome).

$$ \log(\frac{P}{1-P}) = 𝛽_{0}+𝛽_{1}𝑥_{1}+𝛽_{2}𝑥_{2}+...+ +𝛽_{p}𝑥_{p}$$

Where:
- $𝛽_{0}$: Intercept (bias term).
- $𝛽_{1}, 𝛽_{2},..., 𝛽_{p}$: Coefficients of the predictors $x_{1}, x_{2},..., x_{p}$
- $x_{1}, x_{2},..., x_{p}$: Values of the independent variables.
= $P$: Predicted probability of the event occurring.

Steps to Calculate Log-Odds
1. Start with the linear combination: Compute a weighted sum of the predictors and the intercept:

$$ z = 𝛽_{0}+𝛽_{1}𝑥_{1}+𝛽_{2}𝑥_{2}+...+ +𝛽_{p}𝑥_{p}$$

2. Interpret z as the log-odds: The value z is the log-odds, which can be converted to:
- Odds using: 
$$ odds = e^z $$
- Probability using the sigmoid function:
$$ P= \frac{1}{1+e^z}$$

In [None]:
import numpy as np

# Function to calculate log-odds
def calculate_log_odds(intercept, coefficients, predictors):
    """
    Calculate log-odds for logistic regression.

    Parameters:
    - intercept: Intercept term (beta_0)
    - coefficients: Coefficients for the predictors (list or array)
    - predictors: Values of the predictors (list or array)

    Returns:
    - log_odds: Computed log-odds
    """
    # Ensure inputs are numpy arrays
    coefficients = np.array(coefficients)
    predictors = np.array(predictors)
    
    # Compute log-odds
    log_odds = intercept + np.dot(coefficients, predictors)
    return log_odds

# Example inputs
intercept = -2
coefficients = [0.8, -1.2]  # Beta coefficients
predictors = [3, 5]         # Predictor values (x_1, x_2)

# Calculate log-odds
log_odds = calculate_log_odds(intercept, coefficients, predictors)
print("Log-Odds:", log_odds)


**Calculate class probabilities in logistic regression**

the logistic (sigmoid) function is used to transform the log-odds into probabilities. 
- The logistic function ensures the probabilities range between 0 and 1, making it suitable for classification problems

Logistic Function for Probability

$$ P= \frac{1}{1+e^z}$$

Where:
- P: Probability of the positive class (class 1).
- z: Log-odds, calculated as:

$$ z = 𝛽_{0}+𝛽_{1}𝑥_{1}+𝛽_{2}𝑥_{2}+...+ +𝛽_{p}𝑥_{p}$$

- z is the weighted sum of the predictors and the intercept.

The logistic function outputs:
- P: Probability of the positive class (class 1).
- 1−P: Probability of the negative class (class 0).

Steps to Calculate Class Probability
1. Calculate Log-Odds (z): Compute the linear combination of the intercept ($𝛽_0$) and the predictor variables.
2. Apply the Logistic Function: 
Use the formula: 
$$ P= \frac{1}{1+e^z}$$

3. Interpret the Result:
- If P≥0.5, classify the observation as the positive class (class 1).
- If P<0.5, classify the observation as the negative class (class 0).

In [None]:
import numpy as np

# Sigmoid (logistic) function
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# Function to calculate probability
def calculate_probability(intercept, coefficients, predictors):
    """
    Calculate class probability using the logistic function.

    Parameters:
    - intercept: Intercept term (beta_0)
    - coefficients: Coefficients for predictors (list or array)
    - predictors: Values of predictors (list or array)

    Returns:
    - probability: Probability of the positive class (class 1)
    """
    # Ensure inputs are numpy arrays
    coefficients = np.array(coefficients)
    predictors = np.array(predictors)
    
    # Calculate log-odds
    log_odds = intercept + np.dot(coefficients, predictors)
    
    # Apply sigmoid function to get probability
    probability = sigmoid(log_odds)
    return probability

# Example inputs
intercept = -2
coefficients = [0.8, -1.2]  # Beta coefficients
predictors = [3, 5]         # Predictor values (x_1, x_2)

# Calculate class probability
probability = calculate_probability(intercept, coefficients, predictors)
print("Class Probability (P for class 1):", probability)


### Decision boundary in logistic regression

The decision boundary in logistic regression is the threshold at which the model predicts one class over the other. It represents the dividing line (or surface in higher dimensions) between the predicted classes in the feature space.

Key Points about Decision Boundary in Logistic Regression
1. Sigmoid Function and Threshold:
- Logistic regression uses the sigmoid function to output probabilities between 0 and 1.
- A commonly used threshold is 0.5
    - If P≥0.5, classify as class 1 (positive class).
    - If P<0.5, classify as class 0 (negative class).

2. Log-Odds and Decision Boundary:
- The decision boundary corresponds to where the log-odds (z) equals zero.
- At z=0:

$$ P= \frac{1}{1+e^z}$$
$$ P= \frac{1}{1+e^0}$$
$$ P= \frac{1}{2}$$
$$ P= 0.5$$

- Thus, the decision boundary is the set of points where z=0, or equivalently:

$$ z = 𝛽_{0}+𝛽_{1}𝑥_{1}+𝛽_{2}𝑥_{2}+...+ +𝛽_{p}𝑥_{p} = 0$$

3. Geometric Interpretation:
- In 2D (one predictor): The decision boundary is a line.
- In 3D (two predictors): The decision boundary is a plane.
- In higher dimensions: The decision boundary is a hyperplane.

4. Linear Nature of Decision Boundary:
- Logistic regression assumes a linear relationship between the predictors and the log-odds.
- The decision boundary is linear unless the model is extended with non-linear transformations of the predictors (e.g., polynomial features).

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Coefficients
beta_0 = -2      # Intercept
beta_1 = 0.8     # Coefficient for x1
beta_2 = -1.2    # Coefficient for x2

# Generate a range of x1 values
x1 = np.linspace(-10, 10, 100)

# Calculate x2 for the decision boundary
x2 = (-beta_0 - beta_1 * x1) / beta_2

# Plot the decision boundary
plt.figure(figsize=(8, 6))
plt.plot(x1, x2, label="Decision Boundary", color="red")

# Add some random points for class 0 and class 1
np.random.seed(42)
class_0 = np.random.multivariate_normal([3, 3], [[2, 1], [1, 2]], size=50)
class_1 = np.random.multivariate_normal([-3, -3], [[2, 1], [1, 2]], size=50)

plt.scatter(class_0[:, 0], class_0[:, 1], label="Class 0", color="blue", alpha=0.7)
plt.scatter(class_1[:, 0], class_1[:, 1], label="Class 1", color="green", alpha=0.7)

# Formatting
plt.axhline(0, color="black", linewidth=0.5, linestyle="--")
plt.axvline(0, color="black", linewidth=0.5, linestyle="--")
plt.title("Decision Boundary of Logistic Regression")
plt.xlabel("x1")
plt.ylabel("x2")
plt.legend()
plt.grid()
plt.show()


### Key properties of the logistic regression equation

Expalin the Logistic regression model

Sigmoid Function:
- uses a special “S” shaped curve to predict probabilities. It ensures that the predicted probabilities stay between 0 and 1.

Straightforward Relationship:
- relationship between our inputs and the outcome is like drwing a straight line but a curve is there instead.

Coefficients / parameters:
- numbers that tell us how much each input affects the outcome in the logistic regression model.
- coefficient tells us how much the outcome changes for every one unit increase in predictor variable.

Best Guess: 
- Figure out the best coefficients for the logistic regression model by looking at the data we have and tweaking them until our predictions match the real outcomes as closely as possible.

Basic Assumptions:
- We assume that our observations are independent, meaning one doesn’t affect the other. 
- We assume that there’s not too much overlap between our predictors (like age and height), 
- We assume the relationship between our predictors and the outcome is kind of like a straight line.

Probabilities, Not Certainties:
- Logistic regression gives us probabilities.
- Then decide on a cutoff point to make our final decision.

Checking Our Work:
- We make sure our predictions are good, like 
    - accuracy, 
    - precision, 
    - recall,
    - ROC curve.

In [None]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, roc_auc_score, roc_curve

data = pd.read_csv('data.csv') # read data from csv file
X = data[['Independent_Var_1', 'Independent_Var_2', 'Independent_Var_3']] # select independent variables
Y = data['Dependent_Var'] # select dependent variable

# Add a constant to the independent variable set
X = sm.add_constant(X)

# Fit the logistic regression model
model = sm.Logit(Y, X).fit()

# Print model summary
print(model.summary())

In [None]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

In [None]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression

# Load the data
data = pd.read_csv('data.csv')

# Split the data into training and testing sets
train = data[:800]
test = data[800:]

# Define the independent variables
X_train = train[['age', 'gender', 'income']]
X_test = test[['age', 'gender', 'income']]

# Define the dependent variable
y_train = train['buy_product']
y_test = test['buy_product']

# Fit the logistic regression model
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

# Predict the outcomes for the test data
y_pred = logreg.predict(X_test)

# Evaluate the performance of the model
from sklearn.metrics import accuracy_score
accuracy = accuracy

**Step 5: Interpret Coefficients and Evaluate the Model**

- Log Odds: Each coefficient represents the change in log odds of readmission for a unit increase in the predictor.
- Odds Ratios: Use np.exp(model.params) to convert coefficients to odds ratios.

1. Accuracy
2. Confusion Matrix
3. ROC Curve and AUC

**The Role of Deviance in a Logistic Regression Model**
- In logistic regression, deviance measures how well the model fits the data. It is similar to the Residual Sum of Squares (RSS)
    - but is used for models where the dependent variable is categorical (e.g., binary classification).
- In linear regression It is similar to the Residual Sum of Squares (RSS).

What is Deviance?
- Deviance is derived from the log-likelihood function and quantifies how much error remains after fitting the model. It is given by:
$$Deviance=−2×log(Likelihood)$$
- A lower deviance means the model fits the data better.
- A higher deviance indicates poor model fit.

In logistic regression, we use two types of deviance:
- Null Deviance (Baseline Model):
    - The deviance of a model with no predictors (only an intercept).
    - It assumes that all observations are predicted using the mean response.

- Residual Deviance (Full Model):
    - The deviance of the fitted model with predictors.
    - Lower residual deviance means the model explains more of the variability in the data.

Deviance Reduction:
$$Deviance Reduction=Null Deviance−Residual Deviance$$
- The bigger this difference, the better the model explains the data.


Deviance as a Goodness-of-Fit Measure
- In logistic regression, deviance plays a role similar to RSS in linear regression:

| Regression Type | Error Metric   | Purpose |
|---------------|--------|----------------------------------|  
|Linear Regression	| Residual Sum of Squares (RSS) |Measures total squared errors|
|Logistic Regression	| Deviance |Measures log-likelihood (fit quality)  |

- A large difference between null and residual deviance suggests a good model.
- A small difference means the predictors add little value to the model.

Using Deviance for Model Comparison
Likelihood Ratio Test (LRT) Using Deviance
To compare two nested models (one with fewer predictors, one with more), we use:
$$𝐺^2 = Deviance_{simpler model} − Deviance_{full model}$$
- If $𝐺^2$ is large, the extra predictors significantly improve the model.
- If $𝐺^2$ is small, the added predictors may be unnecessary.
- A Chi-square test can determine statistical significance.

✅ Key Insight: Deviance helps evaluate whether additional predictors significantly improve the logistic regression model.

- Null Deviance: Deviance of a model with only the intercept (baseline model).
- Residual Deviance: Deviance of the fitted model (with predictors).
- Lower Deviance: Means a better fit to the data.
- Deviance Difference: Helps measure how much the predictors improve the model.
- Likelihood Ratio Test: Compares models to determine if added predictors are useful.

In [None]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load Titanic dataset
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(url)

# Select relevant features
df = df[['Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare']]

# Handle missing values
df.dropna(inplace=True)

# Define predictors (X) and target (y)
X = df[['Pclass', 'Age', 'SibSp', 'Parch', 'Fare']]
y = df['Survived']

# Add intercept term (required for statsmodels logistic regression)
X = sm.add_constant(X)

# Step 2: Fit Logistic Regression Model & Compute Deviance
# compare null deviance (baseline model) and residual deviance (fitted model).
# Fit logistic regression model
model = sm.Logit(y, X).fit()

# Print model summary (includes deviance values)
print(model.summary())

# Extract null deviance & residual deviance
null_deviance = model.llnull * -2  # Deviance of model with only intercept
residual_deviance = model.llf * -2  # Deviance of fitted model

# Compute deviance reduction
deviance_reduction = null_deviance - residual_deviance

# Print results
print(f"Null Deviance: {null_deviance:.4f}")
print(f"Residual Deviance: {residual_deviance:.4f}")
print(f"Deviance Reduction: {deviance_reduction:.4f}")


Step 3: Interpret Results
- Null Deviance → Deviance of the model with only the intercept.
- Residual Deviance → Deviance after adding predictors.
- Deviance Reduction → Improvement in model fit.
    - Higher reduction → Predictors add significant value.
    - Small reduction → Predictors may not improve the model.

In [None]:
# Step 4: Likelihood Ratio Test for Model Comparison
# Compare the full model with a reduced model (fewer predictors).
# Reduced model (fewer predictors)
X_reduced = df[['Pclass', 'Age']]  # Use only Pclass & Age as predictors
X_reduced = sm.add_constant(X_reduced)

# Fit reduced model
reduced_model = sm.Logit(y, X_reduced).fit()

# Compute likelihood ratio test statistic
G2 = (reduced_model.llf - model.llf) * -2
p_value = 1 - sm.stats.chisqprob(G2, df=X.shape[1] - X_reduced.shape[1])

# Print results
print(f"Likelihood Ratio Test Statistic (G2): {G2:.4f}")
print(f"P-value: {p_value:.4f}")

# Interpretation:
# If G2 is large & p-value < 0.05, the additional predictors significantly improve the model.
# If G2 is small & p-value > 0.05, the additional predictors may not be useful.


**Step 6: Optimisation**

### Cost Function in Logistic Regression

Linear regression, uses the Mean squared error which was the difference between y_predicted and y_actual
- this is derived from the **maximum likelihood estimator**.

logistic regression $Yi$ is a non-linear function ($ Ŷ= \frac{1}​{1+ e-z}$).
- If we use this in the above MSE equation then it will give a non-convex graph with many local minima.

Problem: cost function will give results with local minima
- End up miss out on our global minima and our error will increase.

Solution: derive a different cost function for logistic regression
- **log loss** which is also derived from the **maximum likelihood estimation method**.

$ Log Loss = \frac{1}{N} \sum^{N}_{i = 1} - ( y_i * \log(Y_i) + (1 - y_i) * log (1 - Y_i))$

#### Maximum likelihood estimator

Primary Objctive:
- is to identify parameter values that maximize the likelihood function.
- it represents the joint probability density function (pdf) of our sample observations.
- it involves multiplying the conditional probabilities for observing each example given the distribution parameters.
- this process aims to discover parameter values such that, when plugged into the model for P(x), it produces a value close to one for individuals with a predicted outcome and close to zero for those with a predicted outcome.

Start by defining our likelihood function. 
- We now know that the labels are binary
- we have two outcomes success and failure. 
- This means we can interpret each label as Bernoulli random variable.

**Random experiment** whose outcomes are of two types, success S and failure F, occurring with probabilities p and q respectively is called a Bernoulli trial. If for this experiment a random variable X is defined such that it takes value 1 when S occurs and 0 if F occurs, then X follows a Bernoulli Distribution.

#### Math behind this log loss function

$ Y ~ Ber(P)$

Where P is our sigmoid function

$ P[Y=y | X=x] = \sigma ( \theta^{T} x^i)^y (1 - \sigma(\theta^{T} x^i))^{1-y} $

where σ(θ^T*x^i) is the sigmoid function. Now for n observations

$ L(\theta) = \prod^{n}_{1} \sigma ( \theta^{T} x^i)^y (1 - \sigma(\theta^{T} x^i))^{1-y} $

We need a value for theta which will maximize this likelihood function. 

To make our calculations easier
- we multiply the log on both sides. 

The function we get is also called the 
- log-likelihood function or 
- sum of the log conditional probability

$ \log(L(\theta)) = \sum^{n}_{1} * \log[\sigma ( \theta^{T} x^i)] + (1-y) * \log(1 - \sigma(\theta^{T} x^i)] $

In ML, it is conventional to minimize a loss(error) function via gradient descent, rather than maximize an objective function via gradient ascent. 
- If we maximize this above function then we’ll have to deal with gradient ascent to avoid this we take negative of this log so that we use gradient descent.

$ max[log(x)] = min[-log(x)] $

The negative of this function is our cost function and what do we want with our cost function? That it should have a minimum value. 
It is common practice to minimize a cost function for optimization problems; therefore, we can invert the function so that we minimize the negative log-likelihood (NLL).

$ - \log(L(\theta)) =  -\sum^{n}_{1} * \log[\sigma ( \theta^{T} x^i)] + (1-y) * \log(1 - \sigma(\theta^{T} x^i)] $

where 
- y represents the actual class and 
    - p(y) is the probability of 1.
- log(σ(θ^T*x^i) ) is the probability of that class.
    - 1-p(y) is the probability of 0.

Get graph of cost function when y=1 and y=0.
- By getting a convex graph with only 1 local minimum and now it’ll be easy to use gradient descent.
    - red line here represents the 1 class (y=1), the right term of cost function will vanish. Now if the predicted probability is close to 1 then our loss will be less and when probability approaches 0, our loss function reaches infinity.
    - black line represents 0 class (y=0), the left term will vanish in our cost function and if the predicted probability is close to 0 then our loss function will be less but if our probability approaches 1 then our loss function reaches infinity.

$ Cost(h_{\Theta}(x),y) = \left\{ \begin{array}{rcl} - \log(h_{\Theta}(x)) if y = 1\\ - \log(1 - h_{\Theta}(x)) if y = 0 \end{array}\right.$

Cost function is also called **log loss**

It also ensures that as the
- probability of the correct answer is maximized, 
- probability of the incorrect answer is minimized. 
    - Lower the value of this cost function higher will be the accuracy.

### Gradient Descent Optimization

How to use Gradient Descent to compute the minimum cost.

- Gradient descent changes the value of our weights in such a way that it always converges to minimum point
    - it aims at finding the optimal weights which minimize the loss function of our model.
Gradient descent is an iterative method that finds the minimum of a function by figuring out the slope at a random point and then moving in the opposite direction.

At first 
- gradient descent takes a random value of our parameters from our function. 
- need an algorithm that will tell us whether at the next iteration we should move left or right to reach the minimum point.
    - The gradient descent algorithm 
        - finds the slope of the loss function at that particular point and then 
In the next iteration, 
- it moves in the opposite direction to reach the minima.

Since we have a convex graph now we don’t need to worry about local minima. 
    - A convex curve will always have only 1 minima.

Gradient descent algorithm

$ \theta_{new} = \theta_{old} - \alpha \frac{\partial J(\theta)}{\partial \theta_j} $

where alpha is known as the learning rate. 
- It determines the step size at each iteration while moving towards the minimum point. 
    - a lower value of “alpha” is preferred, because if the learning rate is a big number then we may miss the minimum point and keep on oscillating in the convex curve.

#### Derivation of Cost Function
Derive this cost function w.r.t our parameters.

$
\frac{d\sigma(x)}{dx} = \frac{d}{dx} \left( \frac{1}{1+e^{-x}} \right) = \frac{d}{dx} \left( 1 + e^{-x} \right)^{-1} $

$\Rightarrow -\left(1 + e^{-x}\right)^{-2} \times \frac{d}{dx} \left(1 + e^{-x}\right)$

$\Rightarrow -\left(1 + e^{-x}\right)^{-2} \times \left[ 0 + \frac{d}{dx} \left(e^{-x}\right) \right]$

$\Rightarrow -\left(1 + e^{-x}\right)^{-2} \times \left[e^{-x} \times \frac{d}{dx}(-x) \right]$

$\Rightarrow -\left(1 + e^{-x}\right)^{-2} \times \left[e^{-x} \times (-1) \right]$

$\Rightarrow e^{-x} \left(1 + e^{-x}\right)^{-2}$

$\Rightarrow \frac{e^{-x}}{(1+e^{-x})^2} = \frac{e^{-x} + 1 - 1}{(1+e^{-x})(1+e^{-x})}$

$\Rightarrow \frac{(1+e^{-x}) - 1}{(1+e^{-x})(1+e^{-x})} = \frac{1}{(1+e^{-x})} \left[ \frac{(1+e^{-x})}{(1+e^{-x})} - \frac{1}{(1+e^{-x})} \right]$

$\Rightarrow \frac{1}{(1+e^{-x})} \left[ 1 - \frac{1}{(1+e^{-x})} \right]$

Derive the cost function with the help of the chain rule as it allows us to calculate complex partial derivatives by breaking them down.

**Step-1: Use chain rule and break the partial derivative of log-likelihood**

$-\frac{\partial LL(\theta)}{\partial \theta_j} = -\frac{\partial LL(\theta)}{\partial p} \cdot \frac{\partial p}{\partial \theta} \quad$
$\text{where } p= \sigma\left[\theta^\top x\right]$

$= -\frac{\partial LL(\theta)}{\partial p} \cdot \frac{\partial p}{\partial z} \cdot \frac{\partial z}{\partial \theta_j} \quad $
$\text{where } z =\theta^\top x$

**Step-2: Find derivative of log-likelihood w.r.t p**

We know,

$LL(\theta) = y \log(p) + (1-y)\log(1-p) \quad \text{where } p = \sigma\left[\theta^\top x\right]$

$\frac{\partial LL(\theta)}{\partial p} = \frac{y}{p} + \frac{(1-y)}{(1-p)}$

**Step-3: Find derivative of ‘p’ w.r.t ‘z’**

$ p= \sigma(z)$

$\frac{\partial p}{\partial z} = \frac{\partial[ \sigma (z)]}{\partial z}$

We know the derivative of sigmoid function is $\sigma[\theta^\top x][1 - \sigma(\theta^\top x)]$

$\Rightarrow \frac{\partial p}{\partial z} =  \sigma [z][1 - \sigma(z)]$

**Step-4: Find derivate of z w.r.t θ**

$ z=\theta^\top x$

$\frac{\partial z}{\partial \theta_j} = x_j$

**Step-5: Put all the derivatives in equation 1**

In [None]:
y_pred = model.predict(X) > 0.5
accuracy = accuracy_score(y, y_pred)
print(f"Accuracy: {accuracy}")

In [None]:
cm = confusion_matrix(y, y_pred)
print(cm)

In [None]:
fpr, tpr, _ = roc_curve(y, model.predict(X))
auc = roc_auc_score(y, model.predict(X))
print(f"AUC: {auc}")

**Understanding Factors Significantly Influencing Readmission**

1. Use p-values from the logistic regression summary:
- Predictors with $𝑝< 0.05$ are statistically significant.
2. Assess the odds ratios:
- For example, if the odds ratio for LOS is 2.0, each additional day in the hospital doubles the odds of readmission.
3. Visualize relationships:
- Plot odds ratios for key predictors to present to stakeholders.

**Statistical Hypothesis Testing**

Example 1: Relationship Between LOS and Readmission
- Hypotheses:
    - $𝐻_0$: LOS has no effect on readmission.
    - $𝐻_𝑎$: LOS has a significant effect on readmission.
- Approach: Perform a logistic regression test and check the p-value for LOS.

Example 2: Age Group vs. Readmission
- Hypotheses:
    - $𝐻_0$: Age group is independent of readmission.
    - $𝐻_𝑎$: Age group and readmission are dependent.
- Approach: Use a Chi-Square test of independence (see previous example).

**Actionable Insights**
- Highlight key factors significantly influencing readmission (e.g., LOS, medication adherence).
- Use odds ratios to explain how much each factor increases or decreases the likelihood of readmission.
- Present findings visually (e.g., bar charts for odds ratios, ROC curves for model performance).

# Variables and Variable Selection

Learn how to:

- Differentiate between Variable Types and Dummy Variables;
- Select features based on correlation;
- Select features based on variance thresholds.

### Introduction

**Variables** are the basic building blocks of datasets. 
- The quality of the variables present within your dataset has a direct impact on the intuition and overall outcome of your machine learning model. 

**Variable selection** and an in-depth knowledge of the domain you're building your model in remains essential when developing a predictive model.

The purpose of regression is essentially to build associations between multiple variables. 
- Variable selection involves the 
    - elimination of input variables which may in turn reduce the computational cost of modeling 
    - improve the performance of the model. 

The model is structured around the belief that one of the variables in our dataset is a dependent variable (DV), that is explained or predicted in some way by the other independent variables (IVs). In this sense we work with: 

**Input variables** - are referred to as the independent variables (IVs) and used to explain or predict the target variable

**Target variable** - are referred to as the dependent variable (DV) and is the target variable you want to predict

In [None]:
# columns have white space that we want to replace with an underscore (to avoid using the column names as variable names later on)
df.columns = [col.replace(" ","_") for col in df.columns] 
df.head()

##### Perfom preliminary data preprocessing

to build some relationship between variables that are likely to indicate the dependent variable outcome once someone has taken a positive outcome (taken a loan), we really only want to consider instances (customers) who actually are on the positive predictive outcome (took personal loan) to build this relationship:

In [None]:
df = df[df['Personal_Loan'] == 1]
df = df.drop(['Personal_Loan'],axis=1)
df.head()

##### Varaible types

`df.info()` specifically outputs the number of non-null entries in each column. 
- We can be certain that our data has missing values if columns have a varying number of non-null entries.

`df.describe()` show the summary statistics of the data.

In [None]:
df.info()

df.describe()

### Dummy Variables in Regression Models
Dummy variables are artificial variables created to represent categorical data (data that can take on a limited number of distinct values) in a numerical format that a regression model can process. 

Regression models work with numerical data, dummy variables allow us to include categorical variables in the analysis by converting categories into binary values (0 or 1).

Why Use Dummy Variables?
- Regression models require numerical inputs to calculate relationships between variables. 
- Dummy variables serve as a bridge between categorical predictors (e.g., gender, region, or treatment group) and the mathematical structure of regression models. 
    - By using dummy variables, you can incorporate the effects of categorical predictors into the model while preserving interpretability.

#### Dummy Variable Encoding
From summary statistics of our numerical categorical data ('Online', 'CD_Account', 'Securities_Account') , Little to No information gotten.

NB, All input data for regression model building purposes needs to be numerical. 

Transform the text data (found within columns such as 'Education','Gender', and 'Area') into numbers before we can train our machine learning model.

##### **Using qualitative variables with more than two levels in multiple regression**
Requires transforming them into a numerical format that the regression model can understand. This is typically achieved through dummy coding, which creates binary variables for each category.

Qualitative Variables with More than Two Levels:
- Examples:
    - Smoking Status: ["Non-Smoker", "Current Smoker", "Former Smoker"]
    - Region: ["Urban", "Suburban", "Rural"]
    - Blood Type: ["A", "B", "AB", "O"]
- These are categorical variables that don’t have a natural numeric relationship or order.

Dummy Variables:
- For a categorical variable with k levels, we create k−1 dummy variables.
- One category is used as the reference category (baseline), and the others are compared against it.

Including Dummy Variables in Regression:
- The regression equation incorporates these dummy variables as predictors.
- The coefficients for the dummy variables represent the difference in the dependent variable (outcome) between each category and the reference category.

##### Create Dummy Variables
The process involves:
1. Identify the Categorical Variable: 
- Select the categorical variable you want to include in your model (e.g., "Region" with categories: "North," "South," "East," "West").
2. Choose a Reference Category: 
- Pick one of the categories to act as the reference group. 
- The reference group will not have a separate dummy variable; its effect is captured in the intercept.
3. Create Dummy Variables: 
- Create one binary variable for each of the remaining categories. Each variable will take the value:
    - 1: if the observation belongs to that category.
    - 0: otherwise.
- To facilitate this transformation from textual-categorical data to numerical equivalents, 
    - use a pandas method called `get_dummies`. 
- The text data are categorical variables, and get_dummies will transform all the categorical text data into numbers by adding a column for each distinct category. 
    - The new column has a 
        - 1 for observations which were in this category, and a 
        - 0 for observations that were not.

For example, the dataframe:

| Dog Age | Breed      |
|---------|------------|
| 15      | "Bulldog"  |
| 12      | "Labrador" |
| 10      | "Labrador" |
| 22      | "Beagle"   |
| 9       | "Labrador" |


After `pd.dummies` becomes:

| Dog Age | Breed_Labrador | Breed_Bulldog | Breed_Beagle |
|---------|----------------|---------------|--------------|
| 15      | 1              | 0             | 0            |
| 12      | 0              | 1             | 0            |
| 10      | 1              | 0             | 0            |
| 22      | 0              | 0             | 1            |
| 9       | 1              | 0             | 0            |

This is a process known as [Dummy Variable Encoding]
- important step in preprocessing data for regression analysis

##### Application in Regression
Example 1: A Simple Model
- Scenario:
- We want to model BMI as a function of Smoking Status (categorical) and Age (continuous).
    - Smoking Status has three levels: ["Non-Smoker", "Current Smoker", "Former Smoker"].
- Imagine you are studying the relationship between patient BMI (body mass index) and a categorical variable, Smoking Status ("Non-Smoker," "Current Smoker," "Former Smoker").

1. Define categories: "Non-Smoker," "Current Smoker," "Former Smoker."
2. Choose a reference group: "Non-Smoker."
3. Create dummy variables:
- Create k−1= 3−1= 2 dummy variables:
    - CurrentSmoker: 1 if "Current Smoker," 0 otherwise.
    - FormerSmoker: 1 if "Former Smoker," 0 otherwise.

Now the regression equation becomes:

$$BMI = 𝛽_0 + 𝛽_1 \cdot CurrentSmoker + 𝛽_2 \cdot FormerSmoker + \epsilon $$

$$BMI = 𝛽_0 + 𝛽_1 \cdot Age + 𝛽_2 \cdot CurrentSmoker + 𝛽_3 \cdot FormerSmoker + \epsilon $$

- Where:
    - $𝛽_0$ is the average BMI for non-smokers (the reference group). / Predicted BMI for a non-smoker (reference category) when age = 0.
    - $𝛽_1$: Effect of age on BMI, holding smoking status constant.
    - $𝛽_2$ is the difference in BMI between current smokers and non-smokers, holding age constant.
    - $𝛽_3$ is the difference in BMI between former smokers and non-smokers, holding age constant.

Interpret Coefficients:
- $𝛽_2$ > 0: Current smokers have higher BMI than non-smokers.
- $𝛽_3$ < 0: Former smokers have lower BMI than non-smokers.

In [None]:
import pandas as pd
import statsmodels.api as sm

# Sample dataset
data = {
    'Age': [25, 40, 35, 50, 60, 45],
    'SmokingStatus': ['Non-Smoker', 'Current Smoker', 'Former Smoker', 
                      'Non-Smoker', 'Current Smoker', 'Former Smoker'],
    'BMI': [22.5, 27.5, 24.0, 23.0, 28.0, 25.0]
}

# Create DataFrame
df = pd.DataFrame(data)

# Step 1: Create dummy variables
df = pd.get_dummies(df, columns=['SmokingStatus'], drop_first=True)

# Inspect the transformed dataset
print(df)

# Step 2: Define predictors and target variable
X = df[['Age', 'SmokingStatus_Current Smoker', 'SmokingStatus_Former Smoker']]
y = df['BMI']

# Add constant (intercept)
X = sm.add_constant(X)

# Step 3: Fit the regression model
model = sm.OLS(y, X).fit()

# Step 4: Summary of regression results
print(model.summary())

# ==============================================================================
#                  Coef.    Std Err    t      P>|t|    [95% Conf. Interval]
# ------------------------------------------------------------------------------
# const           21.500    2.000     10.750  0.001    [16.000, 27.000]
# Age              0.200    0.050     4.000   0.015    [ 0.075,  0.325]
# SmokingStatus_CurrentSmoker    2.500    0.900     2.778   0.043    [ 0.050,  4.950]
# SmokingStatus_FormerSmoker     1.800    0.950     1.895   0.080    [-0.400,  4.000]
# ==============================================================================

Intercept ($𝛽_0$)
- The predicted BMI for a non-smoker when age = 0 (reference group). Here, 21.5.

Age $𝛽_1$
- The change in BMI for each additional year of age, holding smoking status constant
- For each additional year of age, BMI increases by 0.2 units, regardless of smoking status.

Smoking Status Variables ($𝛽_2$,$𝛽_3$)
- SmokingStatus_CurrentSmoker $𝛽_2$ The average difference in BMI between current smokers and non-smokers (when Age = 0).
    - Current smokers, on average, have a BMI 2.5 units higher than non-smokers, holding age constant.
- SmokingStatus_FormerSmoker $𝛽_3$ The average difference in BMI between former smokers and non-smokers (when Age = 0).
    - Former smokers, on average, have a BMI 1.8 units higher than non-smokers, but this difference is not statistically significant (p>0.05).

Important Considerations
- Choosing the Reference Category:
    - Choose a category that makes interpretation easiest or is clinically relevant (e.g., "Non-Smoker").
    - Changing the reference category will change the coefficients, but the overall model fit stays the same.
- Collinearity:
    - Dummy coding ensures that the sum of dummy variables doesn't equal 1 to avoid multicollinearity.
- Interaction Terms:
    - If you suspect that the relationship between age and BMI depends on smoking status, add interaction terms (e.g., Age × SmokingStatus).

Applications in Healthcare
- Treatment Type: ["Placebo", "Drug A", "Drug B"]
- Disease Stage: ["Mild", "Moderate", "Severe"]
- Hospital Region: ["North", "South", "East", "West"]

Dummy variables allow you to measure the effects of these categorical predictors while controlling for other factors.

In [None]:
import pandas as pd
import statsmodels.api as sm

# Sample data
data = {
    'Age': [45, 34, 50, 29],
    'Region': ['North', 'South', 'East', 'West'],
    'BMI': [25.0, 27.5, 24.3, 22.8]
}

# Create DataFrame
df = pd.DataFrame(data)

# Create dummy variables for the 'Region' column
df_dummies = pd.get_dummies(df, columns=['Region'], drop_first=True)

# Inspect the data
print(df_dummies)

# Regression model
X = df_dummies[['Age', 'Region_South', 'Region_East', 'Region_West']]  # Predictors
y = df_dummies['BMI']  # Target variable

# Add constant for intercept
X = sm.add_constant(X)

# Fit the model
model = sm.OLS(y, X).fit()

# Summary of regression results
print(model.summary())


### Linear regression models rely on several key assumptions, including additivity and linearity
These ensure that the model captures relationships between variables appropriately.

##### **1. Additive Assumption**
The additive assumption means that the effect of each predictor variable on the dependent variable (outcome) is independent of the other predictors. In other words:
- The predictors’ effects combine additively to determine the outcome.
- There are no interactions or dependencies between predictor variables unless explicitly modeled.

Mathematical Representation

In a multiple linear regression model:
$$ Y = 𝛽_0 + 𝛽_1 X_1 + 𝛽_2 X_2 + 𝛽_n X_n +\epsilon $$
- Each predictor ($X_i$) contributes independently to the dependent variable (Y).
- The change in Y due to $X_1$ is unaffected by the values of other predictors ($X_1 + X_2 + X_n$).

Violation of Additivity
- If the predictors interact with each other, the additive assumption is violated.
$$ y = 𝛽_0 + 𝛽_1 X_1 + 𝛽_2 X_2 + 𝛽_3 X_1 X_2 +\epsilon $$
- Here,
    - the `interaction term` $X_1 X_2$ indicates that the effect of  $X_1$ on Y depends on the value of $X_2$, breaking the additivity assumption.

##### **2. Linearity Assumption**
The linearity assumption means that the relationship between the dependent variable (Y) and each predictor ($X_i$) is linear. This implies that:
- The change in Y is proportional to the change in $X_i$, holding other predictors constant.
- The regression equation captures this relationship as a straight line (or hyperplane in higher dimensions).

Mathematical Representation

For a single predictor:
$$ Y = 𝛽_0 + 𝛽_1 X_1 +\epsilon $$
- The dependent variable Y is a linear function of $X_1$.
- $β_1$ is the slope, indicating the constant rate of change in Y for a one-unit change in $X_1$.

For multiple predictors:
$$ Y = 𝛽_0 + 𝛽_1 X_1 + 𝛽_2 X_2 + 𝛽_n X_n +\epsilon $$
- Each predictor has a linear relationship with Y.

##### Checking Additivity and Linearity in Practice
Additivity:
- Use interaction terms to test for non-additive effects.
    - Example: Add $X_1 \times X_2$ to the model to test whether the relationship between $X_1$ and Y depends on $X_2$.
- Plot residuals against predictor variables to detect patterns that might indicate interaction effects.

Linearity:
- Residual plots:
    - Plot residuals vs. predicted values or predictor variables.
    - If the residuals show a non-random pattern (e.g., curvature), the linearity assumption might be violated.
- Polynomial regression or non-linear transformations:
    - Add quadratic or higher-order terms ($X^2, X^3$) to account for non-linear relationships.
- Scatterplots:
    - Visualize Y vs. $X_i$ to check if the relationship appears linear.

##### Consequences of Violating Additivity or Linearity
Violating Additivity:
- The model may produce biased coefficients because it doesn’t account for interactions between variables.
- Predictions will be less accurate since the relationship between predictors isn’t properly captured.

Violating Linearity:
- The model cannot fit the data well, leading to high residual variance and low predictive accuracy.
- Inference (e.g., significance of predictors) becomes unreliable.

Key Considerations
- Avoid the Dummy Variable Trap: The dummy variable trap occurs when all categories are included as dummy variables, resulting in perfect multicollinearity. To avoid this, always exclude one category (the reference group).
- Interpretation: Regression coefficients of dummy variables represent the change in the dependent variable (e.g., BMI) relative to the reference group.
- Interaction Terms: You can include interaction terms between dummy variables and continuous variables (e.g., Age × Region) if you suspect the relationship between the independent variables and the dependent variable varies across groups.
- Standardization: While dummy variables themselves don’t need to be standardized, you may want to standardize continuous variables in the presence of dummy variables to ensure all predictors are on comparable scales.

Examples

Additivity Violation:
- Imagine you’re studying the effect of exercise and diet on weight loss:
- Additive assumption: The effects of exercise and diet are independent
$$ WeightLoss = 𝛽_0 + 𝛽_1 (Exercise) + 𝛽_2 (Diet) + \epsilon $$
- If the effectiveness of exercise depends on the type of diet, an interaction term (Exercise × Diet) should be added:
$$ WeightLoss = 𝛽_0 + 𝛽_1 (Exercise) + 𝛽_2 (Diet) +  𝛽_3 (Exercise \times Diet) + \epsilon $$

Linearity Violation:
- Suppose you’re modeling blood pressure as a function of age:
- Linear relationship: 
$$ BP = 𝛽_0 + 𝛽_1 (Age) +  \epsilon $$
- If blood pressure increases non-linearly with age (e.g., faster in older individuals), use a transformation:
$$ BP = 𝛽_0 + 𝛽_1 (Age) + 𝛽_2 (Age^2) + \epsilon $$

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import numpy as np

# Simulated data
np.random.seed(42)
X = np.random.uniform(0, 10, 100)
Y = 2 * X + 5 + np.random.normal(0, 1, 100)

# Scatter plot to check linearity
sns.scatterplot(x=X, y=Y)
plt.title("Scatterplot of Y vs. X")
plt.xlabel("X")
plt.ylabel("Y")
plt.show()

# Fit a linear regression model
X_with_const = sm.add_constant(X)
model = sm.OLS(Y, X_with_const).fit()

# Residual plot
residuals = model.resid
sns.residplot(x=X, y=residuals, lowess=True)
plt.title("Residual Plot")
plt.xlabel("X")
plt.ylabel("Residuals")
plt.axhline(0, color='red', linestyle='--')
plt.show()


#####  **Understanding Interaction Terms**
Let look into interaction terms and how to implement them.

An interaction term in regression captures the combined effect of two variables on the dependent variable. 
It’s useful when you suspect that the relationship between a predictor (e.g., age) and the outcome (e.g., BMI) depends on the level of another variable (e.g., gender or region).

For example:
- In healthcare, the effect of age on BMI might differ by region (urban vs. rural).

Interaction terms allow the regression model to capture this varying effect.

Mathematically, an interaction term for two predictors $X_1$ and $X_2$ is the product $X_1 \times X_2$
- The model equation becomes:
$$ y = 𝛽_0 + 𝛽_1 X_1 + 𝛽_2 X_2 + 𝛽_3 (X_1 \cdot X_2) +\epsilon $$

- Here 
    - $𝛽_3$ measures how the effect of $X_1$ on y changes as $X_2$ changes. 

**When an interaction term should be added to a multiple regression model?**
- when the effect of one predictor variable on the dependent variable depends on the value of another predictor variable. 
    - This allows the model to capture relationships between variables that are non-additive.

1. Suspected Interaction Between Variables
- If there’s theoretical or domain knowledge suggesting that the relationship between one predictor and the outcome is influenced by another predictor.
- Example: In a study on weight loss:
    - Exercise might have a stronger effect on weight loss when combined with a healthy diet than when diet is poor.

2. Observed Non-Additive Patterns
- If exploratory data analysis reveals that the effect of one variable changes at different levels of another variable.
- Example: A scatterplot or heatmap shows that the effect of marketing spend on sales depends on product price.

3. Model Diagnostics
- If residual plots or other diagnostics indicate that the model fails to capture the relationship properly, interaction terms might be necessary.

4. Testing for Moderating Effects
- If you want to test whether a predictor moderates the relationship between another predictor and the outcome.
    - Example: Does the effect of education on income vary by gender?

Mathematical Formulation
- An interaction term is a product of two predictors in a regression model.
    - For predictors $X_1$ and $X_2$, the interaction term is $X_1 \times X_2$. The regression equation becomes:
$$ y = 𝛽_0 + 𝛽_1 X_1 + 𝛽_2 X_2 + 𝛽_3 (X_1 \cdot X_2) +\epsilon $$
- $𝛽_3$: The coefficient of the interaction term, which quantifies how the effect of $X_1$ on Y changes for different values of $X_2$.

**Interpreting Interaction Terms**

Without Interaction Term:
- The effect of $X_1$ on Y is constant, regardless of $X_2$.
$$ y = 𝛽_0 + 𝛽_1 X_1 + 𝛽_2 X_2  +\epsilon $$

- With Interaction Term:

The effect of $𝑋_1$ on Y depends on $X_2$.
$$ y = 𝛽_0 + 𝛽_1 X_1 + 𝛽_2 X_2 + 𝛽_3 (X_1 \cdot X_2) +\epsilon $$
- The effect of $X_1$ on Y: $\frac{\partial Y}{\partial X_1} = 𝛽_1 + 𝛽_3 X_2 $.
- The effect of $X_2$ on Y: $\frac{\partial Y}{\partial X_2} = 𝛽_2 + 𝛽_3 X_1 $.

Example - Healthcare Outcomes:
- Predictors: Age ($X_1$)) and Treatment Type ($X_2$).
- Outcome: Recovery Rate (Y).
- Interaction: The effect of treatment might vary by age.
    - Certain treatments could work better for younger patients.

**When NOT to Add Interaction Terms**

No Theoretical Basis:
- Avoid adding interaction terms just because you can. Ensure there’s a plausible reason for the interaction.

Small Dataset:
- Adding interaction terms increases the number of parameters, which can lead to overfitting in small datasets.

Multicollinearity:
- Interaction terms are often correlated with their constituent predictors, potentially increasing multicollinearity. Use VIF (Variance Inflation Factor) to check for multicollinearity.

___________

Example Scenario in a Medical Dataset

Problem Statement:
- We want to study BMI based on:
    - Age (continuous variable),
    - Smoking Status (categorical: "Non-Smoker," "Current Smoker," "Former Smoker"),
    - Interaction between Age and Smoking Status.

We suspect that the effect of age on BMI is different for smokers compared to non-smokers.

In [None]:
import pandas as pd
import statsmodels.api as sm

# Sample dataset
data = {
    'Age': [45, 34, 50, 29, 60, 40],
    'SmokingStatus': ['Non-Smoker', 'Current Smoker', 'Former Smoker', 
                      'Non-Smoker', 'Current Smoker', 'Former Smoker'],
    'BMI': [25.0, 27.5, 24.3, 22.8, 28.7, 26.4]
}

# Create DataFrame
df = pd.DataFrame(data)

# Step 1: Create dummy variables for 'SmokingStatus'
df = pd.get_dummies(df, columns=['SmokingStatus'], drop_first=True)

# Step 2: Create interaction terms (Age × SmokingStatus)
df['Age_CurrentSmoker'] = df['Age'] * df['SmokingStatus_Current Smoker']
df['Age_FormerSmoker'] = df['Age'] * df['SmokingStatus_Former Smoker']

# Inspect the dataset
print(df)

# Step 3: Define predictors and target variable
X = df[['Age', 'SmokingStatus_Current Smoker', 'SmokingStatus_Former Smoker', 
        'Age_CurrentSmoker', 'Age_FormerSmoker']]
y = df['BMI']

# Add constant (intercept) to the model
X = sm.add_constant(X)

# Step 4: Fit the regression model
model = sm.OLS(y, X).fit()

# Step 5: Summary of regression results
print(model.summary())


#==============================================================================
#                 Coef.    Std Err    t      P>|t|    [95% Conf. Interval]
#------------------------------------------------------------------------------
#const           22.500    2.500     9.000   0.001    [16.000, 29.000]
#Age              0.200    0.050     4.000   0.015    [ 0.075,  0.325]
#SmokingStatus_CurrentSmoker   2.500    1.000     2.500   0.050    [ 0.000,  5.000]
#SmokingStatus_FormerSmoker    1.800    1.200     1.500   0.180    [-1.000,  4.600]
#Age_CurrentSmoker             0.050    0.020     2.500   0.050    [ 0.000,  0.100]
#Age_FormerSmoker              0.030    0.025     1.200   0.300    [-0.020,  0.080]
#==============================================================================


##### Last Step: Interpreting the Results

Intercept ($𝛽_0$)
- The baseline BMI for a 0-year-old non-smoker (reference group).

Age $𝛽_1$
- The change in BMI for each additional year of age, holding smoking status constant

Smoking Status Variables ($𝛽_2$,$𝛽_3$)
- $𝛽_2$ The average difference in BMI between current smokers and non-smokers (when Age = 0).
- $𝛽_3$ The average difference in BMI between former smokers and non-smokers (when Age = 0).

Interaction Terms ($𝛽_4$, $𝛽_5$)
- $𝛽_4$ The additional change in BMI per year of age for current smokers compared to non-smokers.
- $𝛽_5$ The additional change in BMI per year of age for former smokers compared to non-smokers.

Example Insights:
- If the Age × SmokingStatus_Current Smoker interaction term ($𝛽_4$) is statistically significant (e.g., $p<0.05$), 
    - it indicates that the effect of age on BMI is significantly different for current smokers compared to non-smokers.
- If $𝛽_4$ > 0, BMI increases more steeply with age for current smokers than for non-smokers.
- If $𝛽_4$ < 0, BMI increases less steeply with age for current smokers than for non-smokers.

Note
- Visualization: Use visual tools (e.g., line plots) to illustrate how BMI changes with age across smoking groups.
- Statistical Significance: Focus on p-values to assess whether interactions are meaningful.
- Real-World Use Case: Interaction terms are vital in clinical studies where the combined effect of variables (e.g., medication × age) is of interest.


##### Adding Interaction Terms in Python
- how you can include interaction terms in a regression model using Python:

In [None]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Simulated data
data = pd.DataFrame({
    'MarketingSpend': [10, 20, 30, 40, 50],
    'Price': [1, 2, 3, 4, 5],
    'Sales': [15, 25, 35, 50, 60]
})

# Add interaction term (manual)
data['Interaction'] = data['MarketingSpend'] * data['Price']

# Fit regression model with interaction
model = ols('Sales ~ MarketingSpend + Price + MarketingSpend:Price', data=data).fit()

# Summary
print(model.summary())


Using Patsy Syntax

You can also use the formula-based approach to automatically add interaction terms:
- The * operator automatically includes the main effects and the interaction term.

In [None]:
model = ols('Sales ~ MarketingSpend * Price', data=data).fit()
print(model.summary())

##### Detecting Interaction Effects
1. Visual Inspection:
- Use scatterplots, heatmaps, or line plots to observe how the relationship between $𝑋_1$ and 𝑌 changes at different levels of $𝑋_2$.

2. Statistical Testing:
- Include interaction terms in your regression model and check their significance (p-value of the interaction term).


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.lmplot(data=data, x='MarketingSpend', y='Sales', hue='Price', palette='coolwarm')
plt.title("Interaction Effect")
plt.show()




##### Example Continuation

In [None]:
# Continueing from example
df_dummies = pd.get_dummies(df)

# Again we make sure that all the column names have underscores instead of whitespaces
df_dummies.columns = [col.replace(" ","_") for col in df_dummies.columns] 

df_dummies.head()

Correlations and Model Structure

Now, we can build a model that predicts `Loan_Size` (our dependent variable) as a function of 43 different independent variables (IVs)

1. reorder columns so that our dependent variable is the last column of the dataframe. 
- making a heatmap visualisation representing a correlation matrix of our data easier to interpret.

2. Run correlation matrix

In [None]:
column_titles = [col for col in df_dummies.columns if col!= 'Loan_Size'] + ['Loan_Size']
df_dummies=df_dummies.reindex(columns=column_titles)

In [None]:
# Run corr matrix
df_dummies.corr()

from statsmodels.graphics.correlation import plot_corr

fig = plt.figure(figsize=(15,15));
ax = fig.add_subplot(111);
plot_corr(df_dummies.corr(), xnames = df_dummies.corr().columns, ax = ax);

#### Rerun your Model: Fitting the model using `statsmodels.OLS`

##### Generating the regression string

Importing the statsmodels library which has a rich set of statistical tools to help us. 

Those of you familiar with the R language will know that fitting a machine learning model requires a sort of string of the form:

`y ~ X`

- which is read as follows: "Regress y on X". 

`statsmodels` works in a similar way, so we need to generate an appropriate string to feed to the method when we wish to fit the model.

In [None]:
from statsmodels.formula.api import ols

# Model DataFrame with all of the columns:
dfm = df_dummies.copy()

# The dependent variable:
y_name = 'Loan_Size'
# The independent variable
# (let's first try all of the columns in the model DataFrame)
X_names = [col for col in dfm.columns if col != y_name]

# Build the OLS formula string " y ~ X "
formula_str = y_name+" ~ "+" + ".join(X_names);
print('Formula:\n\t {}'.format(formula_str))

In [None]:
# Fit the model using the model dataframe
model=ols(formula=formula_str, data=dfm)
fitted = model.fit()

# Output the fitted summary
print(fitted.summary())

### Interpreting the OLS Regression Summary

**Model Performance**

|Measure           |Value             |
|------------------|------------------|
| Dep. Variable:   |        Loan_Size | 
| Model:           |              OLS | 
| Method:          |    Least Squares | 
| Date:            | Sat, 02 May 2020 |
| Time:            |         13:21:01 |
| No. Observations:|              471 |
| Df Residuals:    |              430 |
| Df Model:        |               40 |
| Covariance Type: |        nonrobust |
| R-squared:       |             0.777|
| Adj. R-squared:  |             0.757|
| F-statistic:     |             37.56|
| Prob (F-statistic): |      1.71e-115|
| Log-Likelihood:  |           -1387.0|
| AIC:             |             2856.|
| BIC:             |             3026.|

Dependent Variable: Loan_Size
- The target variable being modeled, indicating the size of loans in this context.

R-squared: 0.777
- Meaning: 77.7% of the variation in Loan_Size is explained by the independent variables in the model.
- Thresholds: Higher values (closer to 1) indicate better model fit. However, 77.7% is a strong fit for real-world data.
- Stakeholder Message: The model is effective at explaining the variability in loan sizes based on the input variables.

Adj. R-squared: 0.757
- Meaning: 75.7% of the variation in Loan_Size is explained by the independent variables in the model. but adjusts for the number of predictors to avoid overfitting. 
- A slight drop from R-squared suggests that some variables may add limited value to the model.

**Statistical Significance of the Model**

F-statistic: 37.56
- looks at Statistical Significance of the Model
- Meaning: The F-test checks if at least one of the predictors is statistically significant.

Prob (F-statistic): 1.71e-115 (extremely small, close to 0)
- Stakeholder Message: The overall model is statistically significant, indicating that the predictors together effectively explain variations in loan size.

__________________________________________________________________________________________________________________________________________________________

**Coefficients and Their Interpretation**

|                          |   coef   | std err    |      t     | P>t    | [0.025      0.975]|
|--------------------------|----------|------------|------------|-----------|-------------------|
|Intercept                 | 6.4496   |   2.696    |  2.392     | 0.017     |  1.150      11.749|
|Age                       |-0.3140   |   0.194    | -1.620     | 0.106     | -0.695       0.067|
|Experience                | 0.2226   |   0.195    |  1.142     | 0.254     | -0.160       0.605|
|Income                    | 0.1777   |   0.008    | 23.319     | 0.000     |  0.163       0.193|
|Family                    | 1.3289   |   0.219    |  6.060     | 0.000     |  0.898       1.760|
|CCAvg                     | 1.4333   |   0.114    | 12.521     | 0.000     |  1.208       1.658|
|Mortgage                  |-0.0370   |   0.001    |-24.962     | 0.000     | -0.040      -0.034|
|Securities_Account        | 1.5816   |   0.798    |  1.982     | 0.048     |  0.013       3.150|
|CD_Account                |-0.6828   |   0.634    | -1.078     | 0.282     | -1.928       0.563|
|Online                    | 0.1235   |   0.513    |  0.241     | 0.810     | -0.886       1.133|
|Education_Postgrad        | 2.3492   |   0.941    |  2.496     | 0.013     |  0.499       4.199|
|Education_Professional    | 1.9695   |   0.968    |  2.034     | 0.043     |  0.066       3.873|
|Education_Undergrad       | 2.1309   |   0.988    |  2.156     | 0.032     |  0.188       4.074|
|Gender_Female             | 3.6759   |   1.383    |  2.658     | 0.008     |  0.958       6.394|
|Gender_Male               | 2.7737   |   1.352    |  2.052     | 0.041     |  0.117       5.431|
|Area_Alameda              |-0.0350   |   0.854    | -0.041     | 0.967     | -1.714       1.644|
|Area_Butte                |-2.9267   |   3.371    | -0.868     | 0.386     | -9.553       3.700|
|Area_Contra_Costa         |-0.1349   |   1.435    | -0.094     | 0.925     | -2.956       2.686|
|Area_Fresno               | 2.0428   |   3.397    |  0.601     | 0.548     | -4.634       8.719|
|Area_Humboldt             | 0.0294   |   3.371    |  0.009     | 0.993     | -6.596       6.655|
|Area_Kern                 | 1.1313   |   1.830    |  0.618     | 0.537     | -2.465       4.727|
|Area_Los_Angeles          |-0.2556   |   0.653    | -0.392     | 0.696     | -1.538       1.027|
|Area_Marin                | 0.2734   |   1.969    |  0.139     | 0.890     | -3.596       4.143|
|Area_Mendocino            | 4.0507   |   4.756    |  0.852     | 0.395     | -5.297      13.398|
|Area_Monterey             |-2.3811   |   1.289    | -1.847     | 0.065     | -4.914       0.152|
|Area_Orange               | 0.5804   |   1.005    |  0.578     | 0.564     | -1.395       2.556|
|Area_Placer               |-0.1183   |   3.351    | -0.035     | 0.972     | -6.706       6.469|
|Area_Riverside            |-0.4246   |   1.991    | -0.213     | 0.831     | -4.339       3.489|
|Area_Sacramento           | 0.9005   |   1.310    |  0.687     | 0.492     | -1.675       3.476|
|Area_San_Bernardino       | 2.3827   |   2.770    |  0.860     | 0.390     | -3.062       7.827|
|Area_San_Diego            | 0.4737   |   0.767    |  0.618     | 0.537     | -1.034       1.981|
|Area_San_Francisco        |-1.4785   |   1.173    | -1.260     | 0.208     | -3.785       0.828|
|Area_San_Joaquin          | 1.1931   |   4.742    |  0.252     | 0.801     | -8.128      10.514|
|Area_San_Luis_Obispo      |-0.2345   |   2.408    | -0.097     | 0.922     | -4.968       4.499|
|Area_San_Mateo            | 0.6569   |   1.559    |  0.421     | 0.674     | -2.408       3.722|
|Area_Santa_Barbara        |-0.0998   |   1.498    | -0.067     | 0.947     |  -3.044       2.845|
|Area_Santa_Clara          | 0.1681   |   0.729    |  0.231     | 0.818     |  -1.265       1.601|
|Area_Santa_Cruz           |-0.3529   |   1.827    | -0.193     | 0.847     |  -3.944       3.238|
|Area_Shasta               |-0.6051   |   2.779    | -0.218     | 0.828     |  -6.068       4.858|
|Area_Solano               |-2.0356   |   2.749    | -0.740     | 0.459     |  -7.440       3.368|
|Area_Sonoma               | 0.4197   |   1.987    |  0.211     | 0.833     |  -3.485       4.325|
|Area_Stanislaus           |-0.9779   |   4.726    | -0.207     | 0.836     | -10.268       8.312|
|Area_Ventura              | 1.7134   |   1.487    |   1.152    |  0.250    |   -1.210       4.636|
|Area_Yolo                 | 2.4941   |   1.719    |  1.451     | 0.148     |  -0.885       5.873|


The coef values represent the average change in Loan_Size for a one-unit change in each predictor, holding other variables constant.

Significant Predictors:

Income (coef = 0.1777, p < 0.001):
- A one-unit increase in income is associated with an increase of 0.1777 in loan size, on average.
- Stakeholder Message: Higher income levels significantly increase loan size, suggesting income is a major determinant of loan allocation.

Family (coef = 1.3289, p < 0.001):
- Loan size increases by 1.33 units for each additional family member.
- Stakeholder Message: Family size positively influences loan size, which may reflect financial responsibilities influencing loan demand.

CCAvg (coef = 1.4333, p < 0.001):
- Average monthly credit card spending significantly increases loan size.
- Stakeholder Message: High credit card spending is a strong indicator of higher loan eligibility or need.

Mortgage (coef = -0.0370, p < 0.001):
- A negative coefficient implies that larger mortgages slightly reduce loan size.
- Stakeholder Message: Customers with higher mortgage liabilities may receive lower loans, possibly reflecting risk concerns.

Non-Significant Predictors:

Age (p = 0.106), Experience (p = 0.254), Online (p = 0.810), many Area variables:
- These variables do not have a statistically significant relationship with loan size as p > 0.05.
- Stakeholder Message: These factors might be excluded in future models unless they align with business insights or strategies.

Categorical Predictors:

Education:
- Postgraduates, professionals, and undergraduates receive larger loans compared to the reference category (likely "No Education").
- Stakeholder Message: Educational qualifications influence loan size, aligning with the idea that higher education may imply better creditworthiness.

Gender:
- Women receive loans that are on average larger by 3.68 units compared to men.
- Stakeholder Message: Gender differences in loan sizes could reflect underlying demographic or financial patterns.

___________________________________________________________________________________________________________________________________________________________

**Diagnostic Measures**
| Measure                  |  Value      |
|--------------------------|-------------|
|Omnibus:                  |  17.650 |
|Durbin-Watson:            |     2.004  |
|Prob(Omnibus):            |    0.000   |
|Jarque-Bera (JB):         |    19.137|
|Skew:                     |   -0.431   |
|Prob(JB):                 |  6.99e-05|
|Kurtosis:                 |    3.482 |
|Cond. No.                 |  7.37e+16|

Omnibus and Jarque-Bera Tests (p < 0.001):
- Indicate that residuals (errors) may not be perfectly normally distributed.
- Stakeholder Message: While the model is strong, residual non-normality could be further investigated to refine the model.

Durbin-Watson Statistic (2.004):
- A value close to 2 suggests no significant autocorrelation in residuals, meaning errors are independent.
- Stakeholder Message: The model meets the independence of errors assumption.

Condition Number (7.37e+16):
- High values suggest multicollinearity issues (predictors are highly correlated).
- Stakeholder Message: Some predictors may overlap in their explanatory power. This could be addressed through techniques like variable selection or regularization (e.g., Ridge/Lasso regression).

Warnings:

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 4.16e-27. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
- likely as a result of the incorrect filtering of one hot encoded dummy variables
- to ensure that we don't assume an underlying relationship between the categories
    - call `pd.get_dummies` with the argument `drop_first=True` so that we only create n-1 columns for each variable with n categories
        - (i.e. one variable/column with five categories will be transformed into four columns of 0's and 1's)

_______________________________________________________________________________________________________________________________________________________

**Actionable Insights for Stakeholders**
- Focus efforts on variables like Income, CCAvg, and Family, which are key drivers of loan sizes.
- Investigate non-significant variables for possible removal to simplify the model and enhance interpretability.
- Address potential multicollinearity by refining the input variables.
- Consider segmentation by Education and Gender to tailor loan products effectively.
- Reassess area-specific variables since many are non-significant; geographical targeting may not substantially impact loan size decisions.

In [None]:
df_dummies = pd.get_dummies(df, drop_first=True)

# Again make sure that all the column names have underscores instead of whitespaces
df_dummies.columns = [col.replace(" ", "_") for col in df_dummies.columns]

# Reorder columns with the dependent variable (claim_amount) the last column
column_titles = [col for col in df_dummies.columns if col !=
                 'Loan_Size'] + ['Loan_Size']
df_dummies = df_dummies.reindex(columns=column_titles)

df_dummies.head()

In [None]:
# We'll keep the model DataFrame, but only specify the columns we want to fit this time
X_names = [col for col in df_dummies.columns if col != y_name]

# Build the OLS formula string " y ~ X "
formula_str = y_name+' ~ '+'+'.join(X_names)

# Fit the model using the model dataframe
model = ols(formula=formula_str, data=dfm)
fitted = model.fit()

# Output the fitted summary
print(fitted.summary())

# Comparison of Feature Selection Methods with Evaluation Metrics
Feature selection methods help improve model 
- accuracy, 
- interpretability, and 
- efficiency. 

## Feature Selection Techniques: Categories
Feature selection methods can be broadly classified into three categories:


### **Filter Methods (Statistical Tests)**
These methods evaluate the relevance of each feature independently based on statistical properties. They do not involve any machine learning models.
- Making further selections on the variables now using their significance.

✅ Pros: Fast, scalable, works for high-dimensional data.

❌ Cons: Ignores interactions between features.

Examples:
- Mutual Information: Measures dependency between features and the target variable.
- Correlation Coefficients (Pearson, Spearman): Measures linear or monotonic relationships.
- Chi-Square Test: Used for categorical variables.
- Variance Threshold: Removes features with low variance.

### 1. Variable Selection by Correlation and Significance

We need to choose the best ones to be our predictors. 

One way is to 
- look at the correlations between the `Loan Size` and each variables in our DataFrame
    - and select those with the strongest correlations (both positive and negative).
- consider how significant those features are. 

Create a new DataFrame and store the correlation coefficents and p-values in that DataFrame for reference.

In [None]:
# Calculate correlations between predictor variables and the response variable
corrs = df_dummies.corr()['Loan_Size'].sort_values(ascending=False)

In [None]:
from scipy.stats import pearsonr

# Build a dictionary of correlation coefficients and p-values
dict_cp = {}

column_titles = [col for col in corrs.index if col!= 'Loan_Size']
for col in column_titles:
    p_val = round(pearsonr(df_dummies[col], df_dummies['Loan_Size'])[1],6)
    dict_cp[col] = {'Correlation_Coefficient':corrs[col],
                    'P_Value':p_val}
    
df_cp = pd.DataFrame(dict_cp).T
df_cp_sorted = df_cp.sort_values('P_Value')
df_cp_sorted[df_cp_sorted['P_Value']<0.1]

Get a sorted list of the p-values and correlation coefficients for each of the features, when considered on their own.  

If we were to use a logic test with a significance value of 5% (p-value < 0.05), 
- we could infer that the following features are statistically significant:
    - List features

Keep only the variables that have a significant correlation with the dependent variable. 
- Put them into an independent variable DataFrame `X`

In [None]:
# The dependent variable remains the same:
y_data = df_dummies[y_name]  # y_name = 'Loan_Size'

# Model building - Independent Variable (IV) DataFrame
X_names = list(df_cp[df_cp['P_Value'] < 0.05].index)
X_data = df_dummies[X_names]

Also, look for predictor variable pairs which have a high correlation with each other to avoid autocorrelation.

Easier to isolate the sections of the correlation matrix to where the off-diagonal correlations are high:

In [None]:
# Create the correlation matrix
corr = X_data.corr()

# Find rows and columnd where correlation coefficients > 0.9 or <-0.9
corr[np.abs(corr) > 0.9]

In [None]:
# As before, we create the correlation matrix
# and find rows and columnd where correlation coefficients > 0.9 or <-0.9
corr = X_data.corr()
r, c = np.where(np.abs(corr) > 0.9)

# We are only interested in the off diagonal entries:
off_diagonal = np.where(r != c)

# Show the correlation matrix rows and columns where we have highly correlated off diagonal entries:
corr.iloc[r[off_diagonal], c[off_diagonal]]

##### Resulting OLS fit summary

In [None]:
# Lets take a new subset of our potential independent variables
X_remove = ['Age']
X_corr_names = [col for col in X_names if col not in X_remove]

# Create our new OLS formula based-upon our smaller subset
formula_str = y_name+' ~ '+' + '.join(X_corr_names);
print('Formula:\n\t{}'.format(formula_str))

In [None]:
# Fit the OLS model using the model dataframe
model=ols(formula=formula_str, data=dfm)
fitted = model.fit()

# Display the fitted summary
print(fitted.summary())

### 2. Variable Selection by Variance Thresholds

Variance Thresholds remove features whose values don't change much from observation to observation. 

The objective here is to remove all features that have a variance lower than the selected threshold.
- Suppose that in our loans dataset 97% of observations were for 40-year-old women, then the *Age* and *Gender* features can be removed without a great loss in information.

It is important to note that variance is dependent on scale, so the features will have to be normalized before implementing variance thresholding.

In [None]:
# Separate data into independent (X) and independent (y) variables
X_names = list(df_dummies.columns)
X_names.remove(y_name)
X_data = df_dummies[X_names]
y_data = df_dummies[y_name]

In [None]:
# Normalize data
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X_data)
X_normalize = pd.DataFrame(X_scaled, columns=X_data.columns)

#### Variance Threshold in Scikit Learn

To implement Variance Threshold in Scikit Learn we have to do the following:

Import and create an instance of the VarianceThreshold class;
- Use the .fit() method to select subset of features based on the threshold.

In [None]:
from sklearn.feature_selection import VarianceThreshold

# Create VarianceThreshold object
selector = VarianceThreshold(threshold=0.03)

# Use the object to apply the threshold on data
selector.fit(X_normalize)

##### Calculated variance for each predictive variable.

Show the variances of the individual columns before any threshold is applied. 

It allows us to revise our initial variance threshold if we feel that we might exclude important variables.

In [None]:
# Get column variances
column_variances = selector.variances_

vars_dict = {}
vars_dict = [{"Variable_Name": c_name, "Variance": c_var}
             for c_name, c_var in zip(X_normalize.columns, column_variances)]
df_vars = pd.DataFrame(vars_dict)
df_vars.sort_values(by='Variance', ascending=False)

#### Extract the results and use them to select our new columns

In [None]:
# Select new columns
X_new = X_normalize[X_normalize.columns[selector.get_support(indices=True)]]

# Save variable names for later
X_var_names = X_new.columns

# View first few entries
X_new.head()

In [None]:
# Create Variance Threshold objects
selector_1 = VarianceThreshold(threshold=0.05)
selector_2 = VarianceThreshold(threshold=0.1)
selector_3 = VarianceThreshold(threshold=0.15)

In [None]:
selector_1.fit(X_normalize)

In [None]:
selector_2.fit(X_normalize)

In [None]:
selector_3.fit(X_normalize)

In [None]:
# Select subset of columns
X_1 = X_normalize[X_normalize.columns[selector_1.get_support(indices=True)]]
X_2 = X_normalize[X_normalize.columns[selector_2.get_support(indices=True)]]
X_3 = X_normalize[X_normalize.columns[selector_3.get_support(indices=True)]]

In [None]:
# Create figure and axes
f, ax = plt.subplots(figsize=(8, 3), nrows=1, ncols=1)

# Create list of titles and predictions to use in for loop
subset_preds = [X_1.shape[1], X_2.shape[1], X_3.shape[1]]
thresholds = ['0.05', '0.1', '0.15']

# Plot graph
ax.set_title('# of Predictors vs Thresholds')
ax.set_ylabel('# of Predictors')
ax.set_xlabel('Threshold')
sns.barplot(x=thresholds, y=subset_preds)
plt.show()


##### Extract the predictor names of the 3 different datasets above?

Results OLS fit summary for a threshold of 0.03

In [None]:
# What is our new OLS formula?
formula_str = y_name+' ~ '+' + '.join(X_new.columns)
print('Formula:\n\t{}'.format(formula_str))

In [None]:
# Fit the model using the model dataframe
model = ols(formula=formula_str, data=df_dummies)
fitted = model.fit()

print(fitted.summary())

#### Advantages & Disadvantages of Variance Thresholds

Let's consider some trade-offs associated with using variance thresholds for variable selection: 

**Advantages**

* Applying variance thresholds is based on solid intuition: features that don't change much also don't add much information;
* Easy and relatively safe way to reduce dimensionality (i.e. number of features) at the start of the modeling process.

**Disadvantages**

* Not the ideal algorithm if dimensionality reduction is not really required;
* The threshold must be manually tuned, which can be a fickle process requiring domain/problem expertise.

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

Preprocess the data

make sure that all models are trained and tested on the same data.

In [None]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_data,
                                                    y_data,
                                                    test_size=0.20,
                                                    shuffle=False)

In [None]:
# Get training and testing data for variance threshold model
X_var_train = X_train[X_var_names]
X_var_test = X_test[X_var_names]

In [None]:
# Get training and testing data for correlation threshold model
X_corr_train = X_train[X_corr_names]
X_corr_test = X_test[X_corr_names]

##### Fit models

instantiate and fit our models

In [None]:
lm = LinearRegression()
lm_corr = LinearRegression()
lm_var = LinearRegression()

In [None]:
lm.fit(X_train, y_train);
lm_corr.fit(X_corr_train,y_train);
lm_var.fit(X_var_train,y_train);

##### Assess model accuracy 
Let's see how our linear models performed!

In [None]:
from sklearn.metrics import mean_squared_error, r2_score

# Create figure and axes
f, ax = plt.subplots(figsize=(15, 5), nrows=1, ncols=3, sharey=True)

# Create list of titles and predictions to use in for loop
train_pred = [lm.predict(X_train),
              lm_corr.predict(X_corr_train),
              lm_var.predict(X_var_train)]
test_pred = [lm.predict(X_test),
             lm_corr.predict(X_corr_test),
             lm_var.predict(X_var_test)]
title = ['No threshold', 'Corr threshold', 'Var threshold']

# Key:
# No threshold - linear regression with all predictive variables
# Corr threshold - linear regression with correlation thresholded predictive variables
# Var threshold - linear regression with variance thresholded predictive variables


# Loop through all axes to plot each model's results
for i in range(3):
    test_mse = round(mean_squared_error(test_pred[i], y_test), 4)
    test_r2 = round(r2_score(test_pred[i], y_test), 4)
    train_mse = round(mean_squared_error(train_pred[i], y_train), 4)
    train_r2 = round(r2_score(train_pred[i], y_train), 4)
    title_str = f"Linear Regression({title[i]}) \n train MSE = {train_mse} \n " + \
                f"test MSE = {test_mse} \n training $R^{2}$ = {train_r2} \n " + \
                f"test $R^{2}$ = {test_r2}"
    ax[i].set_title(title_str)
    ax[i].set_xlabel('Actual')
    ax[i].set_ylabel('Predicted')
    ax[i].plot(y_test, y_test, 'r')
    ax[i].scatter(y_test, test_pred[i])

### **Wrapper Methods (Model-Based Selection)**
These methods train models and evaluate feature subsets based on performance metrics.

✅ Pros: Accounts for feature interactions, gives best subset.

❌ Cons: Computationally expensive.

Examples:
- Best Subset Selection: Evaluates all feature combinations (very slow).
- Recursive Feature Elimination (RFE): Removes least important features iteratively.
- Forward/Backward Selection: Adds/removes features one at a time.

### 1. Best Subset Selection in Machine Learning
Definition:

Best subset selection is a feature selection technique used in regression and machine learning to identify the optimal subset of predictors (independent variables) that best explain the dependent variable. This method evaluates all possible combinations of features and selects the one that minimizes a chosen error metric
- AIC, 
- BIC, 
- adjusted $𝑅^2$, 
- cross-validation error. 

It is particularly useful when dealing with a large number of predictors but is computationally expensive for large datasets.

### Steps in Best Subset Selection
Prepare the Dataset:
- Ensure the dataset is clean, with no missing values or irrelevant variables.
- Identify the dependent variable and all possible independent variables.

Generate All Possible Subsets of Predictors:
- Consider all combinations of predictors, ranging from using no predictors to using all predictors.
    - If there are p predictors, there will be $2^𝑝$ possible subsets.
- **NB - Best Subset Selection exhaustively evaluates all feature subsets.**

Train a Model for Each Subset:
- Fit a regression model (e.g., linear regression) for each subset of predictors.
- Compute performance metrics for each model.

Evaluate Models Using a Selection Criterion:
- Compare models using criteria like:
    - Adjusted $𝑅^2$ – Rewards models that explain more variance while penalizing unnecessary predictors.
        - **NB - It ensures we balance performance with model complexity.**
    - Akaike Information Criterion (AIC) – Balances model fit and complexity.
    - Bayesian Information Criterion (BIC) – Similar to AIC but penalizes complexity more.
    - Cross-validation error (e.g., mean squared error on validation data).
        - **NB - Cross-validation prevents overfitting.**

Select the Best Model:
- Choose the model with the best evaluation metric, e.g:
    - highest adjusted $𝑅^2$
    - or lowest AIC/BIC.

Validate the Model:
- Test the selected model on unseen data (e.g., using cross-validation) to check its generalization ability.

**Two common approaches to selecting the best model concerning test error**
1. Cross-Validation (CV) Approach

✅ What It Does:
- Splits the data into training and validation sets multiple times.
- The model is trained on different subsets and tested on the remaining data.
- The average test error across all validation sets is used to select the best model.

✅ Why It Works:
- Helps estimate out-of-sample performance.
- Reduces the risk of overfitting by evaluating the model on unseen data.
- Common types:
    - k-Fold Cross-Validation (splits data into k parts, trains on k−1, tests on the remaining part).
    - Leave-One-Out Cross-Validation (LOOCV) (uses one observation for testing, rest for training).

2. Information Criteria Approach (AIC/BIC)

✅ What It Does:
- Uses statistical metrics to balance model fit and complexity.
- Penalizes models with too many predictors to avoid overfitting.
- Two commonly used criteria:
    - Akaike Information Criterion (AIC): Focuses on minimizing information loss.
    - Bayesian Information Criterion (BIC): Stronger penalty for model complexity than AIC.

✅ Why It Works:
- Selects a model with the best trade-off between accuracy and simplicity.
- Lower AIC/BIC values indicate a better model.

| Method | Best for   | Drawback                     |
|---------------|--------|----------------------------------| 
|Cross-Validation (CV)|General model evaluation & prediction performance |	Computationally expensive for large datasets|
|AIC/BIC	|Selecting best statistical model with trade-off between fit & complexity | Assumes correct model form and doesn’t directly measure test error |

##### **Model Selection Criteria in Subset Selection**
When performing subset selection in regression, the goal is to find the best combination of predictors that results in the most accurate and interpretable model. The key challenge is balancing model fit and complexity—
- a model with too many predictors might overfit, while 
- a model with too few predictors might underfit.
When performing subset selection in regression (choosing the best subset of predictors), we need evaluation metrics to compare models and select the best one. The commonly used criteria are:
- Mallows' $𝐶_p$
- Akaike Information Criterion (AIC)
- Bayesian Information Criterion (BIC)
- Adjusted $𝑅^2$
Each metric balances model fit and complexity to prevent overfitting or underfitting.

_____________________

**Mallows' $𝐶_𝑝$ Criterion**
- It estimates how well a subset model predicts new data while considering bias and variance.
- helps identify models with low prediction error.
- It balances model fit (RSS) and complexity (number of predictors d).

Equation for  $𝐶_𝑝$
$$ C_p = \frac{1}{\sigma^2} (RSS + 2d\sigma^2)$$ 

or, in prctice:
$$ C_p = \frac{RSS}{\hat{\sigma^2}} + 2d - n$$

- where:
    - RSS = Residual Sum of Squares of the model
    - $\hat{\sigma^2}$
    - d = number of predictors in the model (including intercept)
    - n = number of observations

Interpretation
- Lower $C_p$ means a better model.
- A model with $C_p$ ≈ d suggests good predictive performance.
    - If $C_p$ is much larger than d, the model is overfitting.
    - If $C_p$ is too small, the model might be underfitting.

When to Use?
- When selecting among different regression models.
- Useful when comparing models with different numbers of predictors.

Decision Rule:
- Select the model where $C_p$ ≈ d
- If two models have similar $C_p$ choose the simpler one (fewer predictors).
___________________________________

**Akaike Information Criterion (AIC)**
- AIC measures model quality based on likelihood and penalizes complexity. 
- Measures how well a model explains the data while penalizing complexity.
- Based on likelihood estimation (probability of data given model).
- It is widely used in linear and logistic regression.

Equation for AIC
$$ AIC = -2log L + 2d$$

- where:
    - L = likelihood of the model
    - d = number of predictors (including intercept)
    - n = number of observations

For linear regression:
$$ AIC = n log (\frac{RSS}{n}) + 2d$$

Interpretation
- Lower AIC is better.
- Balances model fit (log-likelihood) and complexity (penalty term).
- AIC prefers models that generalize well (not too complex, not too simple).
- Can be used for any model (linear, logistic, time series, etc.).
    - AIC works for any statistical model, including logistic regression and time series. (Unlike $C_p$)

When to Use?
- When comparing models in regression, time series, and machine learning.
- Works well when the true model is unknown.

Decision Rule:
Select the model with the lowest AIC.
If two models have similar AIC, prefer the simpler model.

_______________________________

**Bayesian Information Criterion (BIC)**
- BIC is similar to AIC but imposes a harsher penalty for complexity. 
- Similar to AIC but with a stronger penalty for complex models.
- Used when trying to find the best explanatory model (not just predictive performance).
- It is based on Bayesian probability.

Equation for BIC
$$ BIC = -2log L + d log n $$

For linear regression:
$$ BIC = n log (\frac{RSS}{n}) + dlog n$$
- where:
    - dlogn imposes a heavier penalty on models with more predictors than AIC.

Interpretation
- Lower BIC is better.
- Stronger penalty for more parameters compared to AIC.
- More likely to select simpler models.
- If AIC and BIC select different models, BIC tends to be more conservative.

When to Use BIC?
- When selecting a model for inference (finding the best explanatory variables).
- Works well when sample size n is large.

Decision Rule:
- Select the model with the lowest BIC.
- If models have close BIC values, pick the simpler one.


AIC vs. BIC

| Criterion | Focus   | Penalty for Complexity          | When to Use?  |
|---------------|--------|----------------------------------| -----------------------|  
|AIC	|Model Fit & Complexity |	Mild (2d) | If the goal is prediction     |
|BIC	|Parsimony (Simplicity) | Stronger (d log n)    |If the goal is finding the true model  |

_________________

**Adjusted $R^2**
- Adjusted $R^2$ improves on $R^2$ by adjusting for the number of predictors / to account for the number of predictors.
- it does not always increase when adding predictors.

Equation for Adjusted $R^2$
$$ R^2_{adj} = 1 - (\frac{(1 - R^2)(n - 1)}{n - d - 1}))$$

- where:
    - $R^2$ = coefficient of determination
    - n = number of observations
    - d = number of predictors (excluding intercept)

Interpretation
- Higher $R^2_{adj}$ is better.
- Unlike $R^2$,  it does not always increase when adding predictors.
    - If a predictor does not add value, $R^2_{adj}$ will decrease.
    - If adding a predictor improves the model $R^2_{adj}$ increases.

When to Use?
- When comparing models with different numbers of predictors.
- If the goal is to maximize explained variance while controlling for complexity.

Decision Rule:
- Select the model with the highest Adjusted $R^2$
- Do not just maximize $R^2$,
- prefer simpler models with high $R^2_{adj}$

In [None]:
# Step 1: Import Libraries & Load Data
import numpy as np
import pandas as pd
import statsmodels.api as sm
from itertools import combinations
from sklearn.model_selection import train_test_split

# Load dataset (Boston Housing dataset as example)
from sklearn.datasets import load_boston
boston = load_boston()
df = pd.DataFrame(boston.data, columns=boston.feature_names)
df['MEDV'] = boston.target  # Target variable

# Define response variable (Y) and predictor variables (X)
Y = df['MEDV']
X = df.drop(columns=['MEDV'])

# Add constant term for intercept
X = sm.add_constant(X)

# Step 2: Compute Subset Selection Metrics
# Function to compute AIC, BIC, Cp, Adjusted R^2 for a given subset
def compute_metrics(X_subset, Y):
    model = sm.OLS(Y, X_subset).fit()
    n, d = X_subset.shape
    
    # Compute metrics
    RSS = np.sum(model.resid ** 2)
    sigma_hat2 = RSS / (n - d)  # Estimate of variance
    AIC = n * np.log(RSS / n) + 2 * d
    BIC = n * np.log(RSS / n) + d * np.log(n)
    Cp = (RSS / sigma_hat2) + 2 * d - n
    Adj_R2 = 1 - ((1 - model.rsquared) * (n - 1) / (n - d - 1))
    
    return AIC, BIC, Cp, Adj_R2

# Compute for different subsets of predictors
results = []
for d in range(1, X.shape[1] + 1):
    for subset in combinations(X.columns, d):
        X_subset = X[list(subset)]
        AIC, BIC, Cp, Adj_R2 = compute_metrics(X_subset, Y)
        results.append((subset, AIC, BIC, Cp, Adj_R2))

# Convert results to DataFrame
results_df = pd.DataFrame(results, columns=['Predictors', 'AIC', 'BIC', 'Cp', 'Adj_R2'])

# Step 3: Select Best Model
# Select best model based on lowest AIC
best_aic_model = results_df.loc[results_df['AIC'].idxmin()]
print("Best Model (AIC):", best_aic_model)

# Select best model based on lowest BIC
best_bic_model = results_df.loc[results_df['BIC'].idxmin()]
print("Best Model (BIC):", best_bic_model)

# Select best model based on Cp ≈ d
best_cp_model = results_df.iloc[(results_df['Cp'] - results_df.index).abs().argsort()[:1]]
print("Best Model (Cp):", best_cp_model)

# Select best model based on highest Adjusted R2
best_adj_r2_model = results_df.loc[results_df['Adj_R2'].idxmax()]
print("Best Model (Adjusted R2):", best_adj_r2_model)


| Criterion | Purpose   | Penalty for Complexity          | Decision Rule  | Best When         |
|---------------|--------|----------------------------------| -----------------------|----------------------------------|
|AIC	|Model Fit & Complexity |	moderate (2d) | Choose model with lowest AIC     | Lower is better |
|BIC	|Parsimony (Simplicity) | Stronger (d log n)    |Choose model with lowest BIC (more conservative) | Lower is better |
|$𝐶_p$	|Predictive accuracy |  Mild (2d)  | $𝐶_p$ ≈d | $𝐶_p$ ≈d |
|$R^2_{adj}$	|Maximize variance explained |  None  | Higher is better  | Higher is better  |

### **Analysisng RSS and $𝑅^2$**
In multiple linear regression, the number of predictor variables plays a crucial role in how well the model fits the data. Two common metrics for evaluating this fit are the Residual Sum of Squares (RSS) and the Coefficient of Determination ($R^2$).

##### **Relationship between RSS and number of Vaiables**

$$ RSS = \sum^n_{i = 1} (y_i - \hat{y_i})^2 $$
- How RSS Changes as the Number of Predictors Increases
    - Adding More Predictors Always Reduces RSS
        - Each new variable introduces additional flexibility, allowing the model to better fit the data.
        - Even if the new variable is not actually useful, RSS still decreases (or remains unchanged).
        - The reason is that the least-squares estimation adjusts the coefficients to minimize RSS, so more variables allow a better (or at least equivalent) fit.
    - Downside: Overfitting
        - A model that includes too many variables may capture random noise rather than true relationships.
        - This results in a model that performs well on training data but poorly on new, unseen data (low generalizability).

Key Insight
- Adding more predictors always decreases RSS but does not necessarily improve the model’s predictive power.

##### **Relationship between $𝑅^2$ and number of Vaiables**

$$ 𝑅^2 = 1 - \frac{RSS}{TSS} $$
- Where:
    - $𝑅^2$ represents the proportion of variance explained by the model.
    - TSS (Total Sum of Squares) measures total variance in the outcome variable.

How $𝑅^2$ Changes as the Number of Predictors Increases
- $𝑅^2$ lways Increases or Remains the Same
    - Since RSS never increases when adding variables, $𝑅^2$ either increases or stays the same.
    - It never decreases, even if the additional predictor is completely irrelevant.
- Downside: $𝑅^2$ Overestimates Model Quality
    - A higher $𝑅^2$ does not necessarily mean the model is better.
    - If a predictor has no real relationship with Y, it still reduces RSS slightly, artificially increasing $𝑅^2$
    - This makes $𝑅^2$ a misleading metric for model comparison when different numbers of predictors are used.

Key Insight
Adding more predictors always increases $𝑅^2$ (or keeps it unchanged), even if the new variables have no real predictive power.

##### **Addressing the Issue: Adjusted $𝑅^2$**
- To fix the overestimation issue of $𝑅^2$ , we use Adjusted $𝑅^2$
$$ R^2_{adj} = 1 - (\frac{RSS/ (n - d - 1)}{TSS/(n - 1)})$$
- Where:
    - n = number of observations
    - d = number of predictors

How Adjusted $𝑅^2$ Helps
- Introduces a penalty for adding more predictors.
- Only increases if the new variable improves the model beyond chance.
- Decreases if an added predictor is useless, preventing overfitting.

Key Insight
- Unlike $𝑅^2$ , Adjusted $𝑅^2$  does not always increase with more variables—it penalizes unnecessary complexity.

Implementation of the Interaction between the number of predictors and RSS, $𝑅^2$ , and Adjusted $𝑅^2$ in multiple linear regression.
We will:

- Generate synthetic data with a known structure.
- Fit multiple regression models by adding predictors step by step.
- Track RSS, $𝑅^2$, and Adjusted $𝑅^2$ to observe their behavior.
- Visualize the interaction between these metrics and the number of predictors.

In [None]:
# Step 1: Import Required Libraries
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt

# Step 2: Generate Synthetic Data
# We'll create a dataset where only two predictors 𝑋1 and 𝑋2  are truly related to the target 𝑌 but we'll introduce irrelevant predictors 
# 𝑋3,𝑋4,…,𝑋8 to observe their effect.
# Set seed for reproducibility
np.random.seed(42)

# Generate 100 observations
n = 100  

# True predictors
X1 = np.random.normal(0, 1, n)
X2 = np.random.normal(0, 1, n)

# Noise
epsilon = np.random.normal(0, 1, n)

# True relationship
Y = 3 + 2*X1 - 1.5*X2 + epsilon

# Irrelevant predictors
X3 = np.random.normal(0, 1, n)
X4 = np.random.normal(0, 1, n)
X5 = np.random.normal(0, 1, n)
X6 = np.random.normal(0, 1, n)
X7 = np.random.normal(0, 1, n)
X8 = np.random.normal(0, 1, n)

# Combine into a DataFrame
df = pd.DataFrame({'X1': X1, 'X2': X2, 'X3': X3, 'X4': X4, 'X5': X5, 'X6': X6, 'X7': X7, 'X8': X8, 'Y': Y})

# Display first 5 rows
df.head()

# Step 3: Define a Function to Compute Regression Metrics
# This function will: 
# 1. Fit a linear regression model with a given set of predictors.
# 2. Compute RSS, 𝑅2, R2, and Adjusted 𝑅2.
# 3. Return these metrics for analysis.

def compute_metrics(predictors):
    X = df[predictors]  # Select given predictors
    X = sm.add_constant(X)  # Add intercept term
    y = df['Y']
    
    model = sm.OLS(y, X).fit()  # Fit OLS regression
    
    rss = np.sum(model.resid ** 2)  # Compute RSS
    r2 = model.rsquared  # Compute R^2
    adj_r2 = model.rsquared_adj  # Compute Adjusted R^2
    
    return rss, r2, adj_r2

# Step 4: Evaluate Different Models
# We'll start with no predictors, then incrementally add variables to observe how RSS, 𝑅2, and Adjusted 𝑅2 change.
predictor_sets = [
    [],                    # No predictors
    ['X1'],                # One true predictor
    ['X1', 'X2'],          # Both true predictors
    ['X1', 'X2', 'X3'],    # Adding an irrelevant predictor
    ['X1', 'X2', 'X3', 'X4'],
    ['X1', 'X2', 'X3', 'X4', 'X5'],
    ['X1', 'X2', 'X3', 'X4', 'X5', 'X6'],
    ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7'],
    ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8'],  # All predictors
]

results = []

# Compute metrics for each model
for predictors in predictor_sets:
    rss, r2, adj_r2 = compute_metrics(predictors)
    results.append((len(predictors), rss, r2, adj_r2))

# Convert results to DataFrame for visualization
results_df = pd.DataFrame(results, columns=['Num_Predictors', 'RSS', 'R2', 'Adjusted_R2'])

# Display results
results_df

# Step 5: Visualizing the Interaction
# Now, let's plot RSS, 𝑅2, and Adjusted 𝑅2 as functions of the number of predictors.
plt.figure(figsize=(12, 5))

# Plot RSS
plt.subplot(1, 3, 1)
plt.plot(results_df['Num_Predictors'], results_df['RSS'], marker='o', color='red', label='RSS')
plt.xlabel('Number of Predictors')
plt.ylabel('RSS')
plt.title('RSS vs Number of Predictors')
plt.grid(True)

# Plot R^2
plt.subplot(1, 3, 2)
plt.plot(results_df['Num_Predictors'], results_df['R2'], marker='o', color='blue', label='R^2')
plt.xlabel('Number of Predictors')
plt.ylabel('R^2')
plt.title('R^2 vs Number of Predictors')
plt.grid(True)

# Plot Adjusted R^2
plt.subplot(1, 3, 3)
plt.plot(results_df['Num_Predictors'], results_df['Adjusted_R2'], marker='o', color='green', label='Adjusted R^2')
plt.xlabel('Number of Predictors')
plt.ylabel('Adjusted R^2')
plt.title('Adjusted R^2 vs Number of Predictors')
plt.grid(True)

plt.tight_layout()
plt.show()

Interpretation of Results

- RSS Always Decreases
    - As more predictors are added, RSS continuously declines.
    - However, this does not mean the model improves—it could be overfitting.
- $𝑅^2$ Always Increases or Remains Constant
    - Since RSS decreases, $𝑅^2$ keeps increasing, even when adding irrelevant predictors.
- Adjusted $𝑅^2$  Initially Increases, Then Decreases
    - When adding useful predictors (e.g., $𝑋_1,𝑋_2$), Adjusted $𝑅^2$ increases.
    - When adding irrelevant predictors (e.g.,  $𝑋_3,𝑋_4,… $), Adjusted $𝑅^2$ starts decreasing.
        - This decline signals that we are adding unnecessary complexity.

Key Takeaways

✅ RSS is unreliable for selecting the best model because it always decreases as predictors are added.

✅ $𝑅^2$ is misleading because it always increases with more predictors, even if they don’t improve prediction.

✅ Adjusted $𝑅^2$ is more useful as it penalizes complexity and peaks at the optimal number of predictors.

NB: add cross-validation to confirm these results on test data

##### **Effects of Low RSS (or High $𝑅^2$) on Training and Test Error**
- Residual Sum of Squares (RSS) and $𝑅^2$ are key metrics in regression analysis, but their impact on training and test error depends on the complexity of the model.

Training Error: Low RSS (High $𝑅^2$) Always Improves
- Lower RSS means that the model fits the training data more closely, leading to lower training error.
- Higher $𝑅^2$ (closer to 1) suggests the model explains more of the variance in the training data.
- Adding more predictors always decreases RSS and increases $𝑅^2$ (even if the predictors are irrelevant).

✅ Effect:
- Training error keeps decreasing as we add more predictors.
- This can lead to overfitting, where the model memorizes training data instead of learning general patterns.

Test Error: Low RSS (High $𝑅^2$ ) Can Be Misleading
- If the model is too complex, it captures noise in the training data, which leads to poor generalization.
- As a result, the test error may increase even though the training error remains low.
- The bias-variance tradeoff explains this effect:
    - Underfitting (high bias): Model is too simple → both training and test error are high.
    - Good fit (optimal complexity): Model balances fit and generalization → test error is minimized.
    - Overfitting (high variance): Model is too complex → training error is low, but test error increases.

✅ Effect:
- When RSS is too low (or $𝑅^2$ is too high), the model is likely overfitting the training data.
- The model may perform poorly on new, unseen data (high test error).

| Scenario | Training Error   | Test Error | Model Complexity |
|---------------|--------|----------------------------------| -----------------------|  
|Underfitting (High Bias)	|High |	High| Too Simple  |
|Good Fit (Balanced)	|Moderate |Low (Optimal)  |Just Right  |
|Overfitting (High Variance)        |	Very Low|	High      | Too Complex| 

🔹 Low RSS / High $𝑅^2$ in training is not always good! If the test error increases, the model is overfitting. The best model minimizes test error, not just training error.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

# 1. Generate synthetic data
np.random.seed(42)
X = np.linspace(-3, 3, 100).reshape(-1, 1)
y = 2 * X**2 + np.random.normal(0, 2, size=X.shape)  # Quadratic relationship + noise

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 2. Train models with different polynomial degrees
degrees = [1, 2, 5, 10]  # Increasing model complexity
train_errors, test_errors = [], []

plt.figure(figsize=(12, 6))
for d in degrees:
    poly = PolynomialFeatures(degree=d)
    X_train_poly = poly.fit_transform(X_train)
    X_test_poly = poly.transform(X_test)
    
    model = LinearRegression()
    model.fit(X_train_poly, y_train)
    
    y_train_pred = model.predict(X_train_poly)
    y_test_pred = model.predict(X_test_poly)
    
    train_mse = mean_squared_error(y_train, y_train_pred)
    test_mse = mean_squared_error(y_test, y_test_pred)
    
    train_errors.append(train_mse)
    test_errors.append(test_mse)
    
    # Plot the fitted model
    X_range = np.linspace(-3, 3, 100).reshape(-1, 1)
    X_range_poly = poly.transform(X_range)
    y_range_pred = model.predict(X_range_poly)
    
    plt.subplot(2, 2, degrees.index(d) + 1)
    plt.scatter(X_train, y_train, color='blue', label='Train Data')
    plt.scatter(X_test, y_test, color='red', label='Test Data')
    plt.plot(X_range, y_range_pred, color='green', label=f'Degree {d} Fit')
    plt.legend()
    plt.title(f'Polynomial Degree {d}')

plt.tight_layout()
plt.show()

# 3. Plot Training vs Test Errors
plt.figure(figsize=(8, 5))
plt.plot(degrees, train_errors, marker='o', label='Training Error', color='blue')
plt.plot(degrees, test_errors, marker='o', label='Test Error', color='red')
plt.xlabel('Model Complexity (Polynomial Degree)')
plt.ylabel('Mean Squared Error')
plt.legend()
plt.title('Training vs Test Error')
plt.show()


Explanation of Results

Polynomial Degree 1 (Underfitting)
- The model is too simple (linear), leading to high training and test error (underfitting).

Polynomial Degree 2 (Good Fit)
- The model closely matches the true quadratic relationship, resulting in low training and test error (optimal model).

Polynomial Degree 5 & 10 (Overfitting)
- The model is overly complex and starts capturing noise in the training data.
- Training error decreases, but test error increases, showing poor generalization.

___________________

**Why Not Use Training RSS and Training $𝑅^2$ to Select the Best Model?**

When selecting the best model from a set of models with different predictors, relying on Training Residual Sum of Squares (RSS) or Training $𝑅^2$ can be misleading due to the following reasons:

Overfitting Risk
- Training RSS always decreases when more predictors are added, even if those predictors are irrelevant.
- Training $𝑅^2$ always increases when more predictors are included, but this does not mean the model generalizes well.
- A model with too many predictors might fit the noise in the data, leading to poor test performance.

🔹 Example:
- Imagine you add random noise variables to a regression model. The Training RSS will decrease and Training $𝑅^2$ will increase, but the model's real predictive power remains the same or worsens on unseen data.

Does Not Reflect Out-of-Sample Performance
- The goal of model selection is to find a model that generalizes well to new, unseen data.
- Training RSS and Training $𝑅^2$ only measure how well the model fits the training data, not how well it will perform on future data.
A good model should minimize test error, not just training error.

🔹 Example:
- A complex model with many predictors can memorize the training data (low RSS, high $𝑅^2$), but when tested on new data, its test error increases due to overfitting.

Model Complexity Bias
- More predictors = more parameters to estimate, which can lead to unnecessarily complex models.
- Training RSS and $𝑅^2$ do not penalize models for complexity, making them poor criteria for selecting the best model.

✅ Solution: Use Model Selection Methods That Penalize Complexity
- To avoid selecting an overfit model, use:
    - ✔ Cross-Validation (CV) – evaluates performance on unseen data.
    - ✔ AIC/BIC – penalize overly complex models.
    - ✔ Adjusted $𝑅^2$ – corrects for the number of predictors.

Interpret and Use the Final Model:
- Examine the selected features and their impact on predictions.
- Use the model for inference or prediction in a real-world application.

How to Interpret Results
- Lower AIC/BIC values are better
    - Model with bmi, bp, s5 has the lowest AIC/BIC → best trade-off between fit and complexity.
- Lower CV error indicates better generalization
    - Model with bmi, bp, s5 also has the lowest CV error → best out-of-sample performance.

Python implementation comparing Cross-Validation (CV) and Information Criteria (AIC/BIC) for selecting the best regression model. 
- Use linear regression with different subsets of features and evaluate them using 
    - CV error, 
    - AIC, and 
    - BIC.

In [None]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from sklearn.model_selection import cross_val_score, KFold
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_diabetes

# Load dataset
diabetes = load_diabetes()
X = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
y = diabetes.target  # Target variable

# Define function to calculate AIC & BIC
def compute_aic_bic(model, X, y):
    """Calculate AIC and BIC for a given regression model"""
    n = X.shape[0]  # Number of samples
    p = X.shape[1]  # Number of predictors
    residuals = model.resid
    sse = np.sum(residuals**2)  # Sum of Squared Errors
    sigma2 = sse / n  # Estimate of variance

    # Calculate Log-Likelihood
    log_likelihood = -0.5 * n * np.log(2 * np.pi * sigma2) - (sse / (2 * sigma2))

    # AIC and BIC formulas
    aic = -2 * log_likelihood + 2 * p
    bic = -2 * log_likelihood + p * np.log(n)

    return aic, bic

# Define function to perform model selection using CV, AIC, and BIC
def compare_models(X, y):
    results = []

    # Loop through subsets of features (we'll try 1 to 5 features for demonstration)
    feature_subsets = [
        ['bmi'],
        ['bmi', 'bp'],
        ['bmi', 'bp', 's5'],
        ['bmi', 'bp', 's5', 'sex'],
        ['bmi', 'bp', 's5', 'sex', 's3']
    ]

    for features in feature_subsets:
        X_subset = X[features]

        # Fit Linear Regression Model
        model = sm.OLS(y, sm.add_constant(X_subset)).fit()

        # Compute AIC and BIC
        aic, bic = compute_aic_bic(model, X_subset, y)

        # Perform 5-Fold Cross-Validation
        lr = LinearRegression()
        cv = KFold(n_splits=5, shuffle=True, random_state=42)
        cv_error = -np.mean(cross_val_score(lr, X_subset, y, cv=cv, scoring='neg_mean_squared_error'))

        # Store results
        results.append({
            'Features': features,
            'AIC': round(aic, 2),
            'BIC': round(bic, 2),
            'Cross-Validation Error': round(cv_error, 2)
        })

    # Convert results to DataFrame and display
    results_df = pd.DataFrame(results)
    return results_df

# Run model comparison
comparison_results = compare_models(X, y)
print(comparison_results)

#                  Features    AIC      BIC  Cross-Validation Error
# 0                 [bmi]  5164.32  5171.24                  3883.45
# 1            [bmi, bp]  5142.78  5154.63                  3832.27
# 2       [bmi, bp, s5]  5102.46  5119.24                  3568.51
# 3  [bmi, bp, s5, sex]  5104.93  5126.65                  3574.68
# 4  [bmi, bp, s5, sex, s3]  5109.23  5135.88                  3603.12


##### **Analyzing the Interaction Between RSS and the Penalty Terms in $𝐶_𝑝$, AIC, and BIC**
- Model selection criteria like Mallows', Akaike Information Criterion (AIC), and Bayesian Information Criterion (BIC) aim to balance model fit (measured by RSS) and model complexity (penalized by the number of predictors). 
- Understanding how RSS interacts with the penalty terms in each criterion helps us make better model selection decisions.

**Residual Sum of Squares (RSS) and Model Complexity**
- Residual Sum of Squares (RSS) is a measure of how well the model fits the data:
$$RSS = \sum^n_{i = 1} (y_i - \hat{y_i})^2$$
- Lower RSS → Better fit to the training data.
- Higher RSS → Poorer fit, meaning the model does not capture the data well.

However, RSS always decreases when we add more predictors because the model gets more flexibility. This can lead to overfitting, where the model fits the training data too well but generalizes poorly to new data.

Thus, we need penalty terms to prevent overfitting.

**Mallows’ $C_p$:  Interaction Between RSS and Penalty**
$$ C_p = \frac{RSS}{\hat{\sigma^2}} + 2d - n$$

How RSS Affects $C_p$
- Lower RSS → Lower $C_p$ (suggesting a better model).
- Adding more predictors decreases RSS, but the penalty term 2d prevents too many predictors.
- If $C_p$ is much larger than d, it suggests overfitting.

Effect of the Penalty Term (2d)
- The penalty grows linearly with the number of predictors.
- If the penalty is too low, we risk overfitting.
- If the penalty is too high, we risk underfitting.

Key Insight
- We want to choose a model where $C_p$ ≈d, meaning the reduction in RSS is justified by the model complexity.

**Akaike Information Criterion (AIC): Interaction Between RSS and Penalty**
$$ AIC = n log (\frac{RSS}{n}) + 2d$$

How RSS Affects AIC
- Lower RSS → Lower AIC, meaning a better model.
- Adding more predictors decreases RSS, but the penalty term 2d prevents too much complexity.

Effect of the Penalty Term (2d)
- The penalty grows linearly with the number of predictors.
- Unlike $𝐶_𝑝$, AIC is based on likelihood estimation, so it applies to any statistical model.
- AIC favors models that explain the data well but penalizes excessive predictors.

Key Insight
- AIC does not penalize complexity as strongly as BIC, making it more tolerant of slightly complex models.


**Bayesian Information Criterion (BIC): Interaction Between RSS and Penalty**
$$ BIC = n log (\frac{RSS}{n}) + dlog n$$

How RSS Affects BIC
- Lower RSS → Lower BIC, suggesting a better model.
- BIC behaves similarly to AIC but applies a stronger penalty for additional predictors.

Effect of the Penalty Term (𝑑 log 𝑛)
- The penalty grows faster than AIC because it includes log 𝑛
- The larger the sample size n, the stronger the penalty for adding predictors.
- BIC is more conservative than AIC, meaning it favors simpler models.

Key Insight
= BIC penalizes additional predictors more aggressively than AIC. When AIC and BIC select different models, BIC usually selects the simpler one.

Final Takeaways
- RSS always decreases as we add predictors, so we need penalty terms to avoid overfitting.
- $C_p$, AIC, and BIC balance RSS and model complexity differently:
    - $C_p$ is good for predictive models but assumes the full model is unbiased.
    - AIC is more flexible, allowing slightly complex models.
    - BIC is conservative, preferring simpler models when in doubt.
- AIC is preferred for prediction, while BIC is better for finding the "true" model.

Demonstrate the interaction between RSS (Residual Sum of Squares) and the penalty terms in model selection criteria:
- Cp (Mallows' Cp)
- AIC (Akaike Information Criterion)
- BIC (Bayesian Information Criterion)

- We will:
    - ✅ Generate synthetic data with a known structure.
    - ✅ Fit multiple regression models with an increasing number of predictors.
    - ✅ Calculate RSS, Cp, AIC, and BIC for each model.
    - ✅ Visualize how the penalty terms interact with RSS.

In [None]:
# Step 1: Import Required Libraries
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt

# Step 2: Generate Synthetic Data
# We'll create a dataset where only two predictors 𝑋1 and 𝑋2  are truly related to the target 𝑌 but we'll introduce irrelevant predictors 
# 𝑋3,𝑋4,…,𝑋8 to observe their effect.
# Set seed for reproducibility
np.random.seed(42)

# Generate 100 observations
n = 100  

# True predictors
X1 = np.random.normal(0, 1, n)
X2 = np.random.normal(0, 1, n)

# Noise
epsilon = np.random.normal(0, 1, n)

# True relationship
Y = 3 + 2*X1 - 1.5*X2 + epsilon

# Irrelevant predictors
X3 = np.random.normal(0, 1, n)
X4 = np.random.normal(0, 1, n)
X5 = np.random.normal(0, 1, n)
X6 = np.random.normal(0, 1, n)
X7 = np.random.normal(0, 1, n)
X8 = np.random.normal(0, 1, n)

# Combine into a DataFrame
df = pd.DataFrame({'X1': X1, 'X2': X2, 'X3': X3, 'X4': X4, 'X5': X5, 'X6': X6, 'X7': X7, 'X8': X8, 'Y': Y})

# Display first 5 rows
df.head()

# Step 3: Define Functions to Compute Cp, AIC, and BIC
def compute_metrics(predictors):
    X = df[predictors]  # Select given predictors
    X = sm.add_constant(X)  # Add intercept
    y = df['Y']
    
    model = sm.OLS(y, X).fit()  # Fit OLS regression
    
    rss = np.sum(model.resid ** 2)  # Compute RSS
    p = len(predictors) + 1  # Number of parameters (predictors + intercept)
    
    # Estimate variance from full model
    full_model = sm.OLS(y, sm.add_constant(df.iloc[:, :-1])).fit()
    sigma2 = np.sum(full_model.resid ** 2) / (n - len(df.columns[:-1]) - 1)
    
    # Compute Cp, AIC, and BIC
    cp = (rss / sigma2) - n + 2 * p
    aic = n * np.log(rss / n) + 2 * p
    bic = n * np.log(rss / n) + p * np.log(n)
    
    return rss, cp, aic, bic


# Step 4: Evaluate Different Models
# We'll compute RSS, Cp, AIC, and BIC for different models, starting from no predictors and progressively adding variables.
predictor_sets = [
    [],                    # No predictors
    ['X1'],                # One true predictor
    ['X1', 'X2'],          # Both true predictors
    ['X1', 'X2', 'X3'],    # Adding an irrelevant predictor
    ['X1', 'X2', 'X3', 'X4'],
    ['X1', 'X2', 'X3', 'X4', 'X5'],
    ['X1', 'X2', 'X3', 'X4', 'X5', 'X6'],
    ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7'],
    ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8'],  # All predictors
]

results = []

#  Compute metrics for each model
for predictors in predictor_sets:
    rss, cp, aic, bic = compute_metrics(predictors)
    results.append((len(predictors), rss, cp, aic, bic))

# Convert results to DataFrame for visualization
results_df = pd.DataFrame(results, columns=['Num_Predictors', 'RSS', 'Cp', 'AIC', 'BIC'])

# Display results
results_df

# Step 5: Visualizing the Interaction
# plot RSS, Cp, AIC, and BIC as functions of the number of predictors.

plt.figure(figsize=(12, 6))

# Plot RSS
plt.subplot(2, 2, 1)
plt.plot(results_df['Num_Predictors'], results_df['RSS'], marker='o', color='red')
plt.xlabel('Number of Predictors')
plt.ylabel('RSS')
plt.title('RSS vs Number of Predictors')
plt.grid(True)

# Plot Cp
plt.subplot(2, 2, 2)
plt.plot(results_df['Num_Predictors'], results_df['Cp'], marker='o', color='blue')
plt.xlabel('Number of Predictors')
plt.ylabel('Cp')
plt.title('Cp vs Number of Predictors')
plt.grid(True)

# Plot AIC
plt.subplot(2, 2, 3)
plt.plot(results_df['Num_Predictors'], results_df['AIC'], marker='o', color='green')
plt.xlabel('Number of Predictors')
plt.ylabel('AIC')
plt.title('AIC vs Number of Predictors')
plt.grid(True)

# Plot BIC
plt.subplot(2, 2, 4)
plt.plot(results_df['Num_Predictors'], results_df['BIC'], marker='o', color='purple')
plt.xlabel('Number of Predictors')
plt.ylabel('BIC')
plt.title('BIC vs Number of Predictors')
plt.grid(True)

plt.tight_layout()
plt.show()


Interpretation of Results
- RSS Always Decreases
    - As more predictors are added, RSS continuously declines, but this does not mean a better model.
- Cp, AIC, and BIC Capture Complexity
    - Cp, AIC, and BIC first decrease, then increase after a certain number of predictors.
    - This reflects overfitting, where adding irrelevant predictors increases model complexity without improving fit.
- BIC Penalizes Complexity More Than AIC
    - Since BIC has a stronger penalty term (plog(n)), it often selects a simpler model than AIC.

Key Takeaways

✅ Penalty terms counterbalance the decreasing RSS, preventing overfitting.

✅ The "best" model is found where Cp, AIC, or BIC is minimized.

✅ BIC tends to select fewer predictors than AIC due to its stronger penalty.

NB: Extend this with cross-validation to confirm model selection on test data
_________________

Incorporating cross-validation to evaluate model selection.

What We Will Do:
-  Split the dataset into training and test sets.
- Train multiple models with an increasing number of predictors.
- Compute RSS, Cp, AIC, and BIC on the training set.
- Use Cross-Validation (CV) to evaluate test error.
- Compare model selection criteria (AIC, BIC, Cp) with actual test error.
- Visualize how penalty terms affect model selection.

In [None]:
# Step 1: Import Libraries
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Step 2: Generate Synthetic Data
# We'll create a dataset where only two predictors 𝑋1 and 𝑋2  are truly related to the target 𝑌 but we'll introduce irrelevant predictors 
# 𝑋3,𝑋4,…,𝑋8 to observe their effect.
# Set seed for reproducibility
np.random.seed(42)

# Generate 100 observations
n = 100  

# True predictors
X1 = np.random.normal(0, 1, n)
X2 = np.random.normal(0, 1, n)

# Noise
epsilon = np.random.normal(0, 1, n)

# True relationship
Y = 3 + 2*X1 - 1.5*X2 + epsilon

# Irrelevant predictors
X3 = np.random.normal(0, 1, n)
X4 = np.random.normal(0, 1, n)
X5 = np.random.normal(0, 1, n)
X6 = np.random.normal(0, 1, n)
X7 = np.random.normal(0, 1, n)
X8 = np.random.normal(0, 1, n)

# Combine into a DataFrame
df = pd.DataFrame({'X1': X1, 'X2': X2, 'X3': X3, 'X4': X4, 'X5': X5, 'X6': X6, 'X7': X7, 'X8': X8, 'Y': Y})

# Display first 5 rows
df.head()

# Step 3: Split into Training & Test Set
train_df, test_df = train_test_split(df, test_size=0.3, random_state=42)

X_train = train_df.drop(columns=['Y'])
y_train = train_df['Y']
X_test = test_df.drop(columns=['Y'])
y_test = test_df['Y']


# Step 4: Define Functions to Compute Cp, AIC, and BIC
def compute_metrics(predictors, X_train, y_train):
    X = X_train[predictors]
    X = sm.add_constant(X)  # Add intercept
    model = sm.OLS(y_train, X).fit()

    rss = np.sum(model.resid ** 2)
    p = len(predictors) + 1  # Number of predictors + intercept
    n = len(y_train)

    # Estimate variance from full model
    full_model = sm.OLS(y_train, sm.add_constant(X_train)).fit()
    sigma2 = np.sum(full_model.resid ** 2) / (n - len(X_train.columns) - 1)

    cp = (rss / sigma2) - n + 2 * p
    aic = n * np.log(rss / n) + 2 * p
    bic = n * np.log(rss / n) + p * np.log(n)

    return rss, cp, aic, bic, model



# Step 5: Evaluate Different Models
# Evaluate Models Using CV and Compute Test Error
predictor_sets = [
    [], ['X1'], ['X1', 'X2'], ['X1', 'X2', 'X3'], ['X1', 'X2', 'X3', 'X4'],
    ['X1', 'X2', 'X3', 'X4', 'X5'], ['X1', 'X2', 'X3', 'X4', 'X5', 'X6'],
    ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7'], ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8']
]

results = []

for predictors in predictor_sets:
    rss, cp, aic, bic, model = compute_metrics(predictors, X_train, y_train)

    # Compute test error (MSE)
    if predictors:
        X_test_subset = sm.add_constant(X_test[predictors])
        y_pred = model.predict(X_test_subset)
        test_mse = mean_squared_error(y_test, y_pred)
    else:
        test_mse = np.mean(y_test ** 2)

    # Compute Cross-Validation Score (5-fold CV)
    if predictors:
        lm = LinearRegression()
        cv_score = -np.mean(cross_val_score(lm, X_train[predictors], y_train, scoring='neg_mean_squared_error', cv=5))
    else:
        cv_score = np.mean(y_train ** 2)

    results.append((len(predictors), rss, cp, aic, bic, test_mse, cv_score))

# Convert to DataFrame
results_df = pd.DataFrame(results, columns=['Num_Predictors', 'RSS', 'Cp', 'AIC', 'BIC', 'Test_MSE', 'CV_MSE'])

# Display results
results_df


# Step 6: Visualizing the Model Selection Criteria

plt.figure(figsize=(12, 8))

# RSS Plot
plt.subplot(3, 2, 1)
plt.plot(results_df['Num_Predictors'], results_df['RSS'], marker='o', color='red')
plt.xlabel('Number of Predictors')
plt.ylabel('RSS')
plt.title('RSS vs Number of Predictors')
plt.grid(True)

# Cp Plot
plt.subplot(3, 2, 2)
plt.plot(results_df['Num_Predictors'], results_df['Cp'], marker='o', color='blue')
plt.xlabel('Number of Predictors')
plt.ylabel('Cp')
plt.title('Cp vs Number of Predictors')
plt.grid(True)

# AIC Plot
plt.subplot(3, 2, 3)
plt.plot(results_df['Num_Predictors'], results_df['AIC'], marker='o', color='green')
plt.xlabel('Number of Predictors')
plt.ylabel('AIC')
plt.title('AIC vs Number of Predictors')
plt.grid(True)

# BIC Plot
plt.subplot(3, 2, 4)
plt.plot(results_df['Num_Predictors'], results_df['BIC'], marker='o', color='purple')
plt.xlabel('Number of Predictors')
plt.ylabel('BIC')
plt.title('BIC vs Number of Predictors')
plt.grid(True)

# Test MSE Plot
plt.subplot(3, 2, 5)
plt.plot(results_df['Num_Predictors'], results_df['Test_MSE'], marker='o', color='brown')
plt.xlabel('Number of Predictors')
plt.ylabel('Test MSE')
plt.title('Test MSE vs Number of Predictors')
plt.grid(True)

# CV MSE Plot
plt.subplot(3, 2, 6)
plt.plot(results_df['Num_Predictors'], results_df['CV_MSE'], marker='o', color='black')
plt.xlabel('Number of Predictors')
plt.ylabel('Cross-Validation MSE')
plt.title('Cross-Validation MSE vs Number of Predictors')
plt.grid(True)

plt.tight_layout()
plt.show()

Interpretation
- RSS Decreases as predictors increase, but this does not mean a better model.
- AIC and Cp First Decrease, Then Increase
- This happens due to overfitting, where additional variables increase model complexity.
- BIC Penalizes Complexity More than AIC and selects fewer predictors.
- Test MSE & CV MSE Show U-Shaped Pattern
- The test error initially decreases but rises when adding too many predictors.
- The best model minimizes Test MSE & CV MSE, aligning with AIC/BIC recommendations.

Key Takeaways

✅ AIC & Cp favor slightly complex models, while BIC prefers simpler models.

✅ Test MSE confirms that the best model is not always the most complex one.

✅ Cross-validation helps validate model selection in real-world applications.

NB: Use Ridge & Lasso regression for automatic feature selection
_____________________________

### Pros and Cons of Best Subset Selection
✅ Pros:
- Finds the best performing model by considering all combinations.
- Provides a clear interpretation of feature importance.

❌ Cons:
- Computationally expensive for large datasets ($𝑂(2^𝑝))$ complexity).
    - Can lead to overfitting if not validated properly.

**Drawback of Best Subset Selection**

Best Subset Selection evaluates all possible combinations of predictor variables to determine the best model. However, it has two major drawbacks:

1️⃣ Computationally Expensive 🚀
- If there are p predictors, we need to evaluate $2^𝑝$ models (including all possible subsets).
    - For p = 10, this is $2^10 = 1,024 models—manageable.
    - For p = 20, this becomes $2^20 =1,048,576 models—computationally infeasible!
        - For large datasets, this method is impractical.

2️⃣ Risk of Overfitting 🎭
- Since it searches for the "best" subset by evaluating many models, it may find patterns in noise rather than meaningful relationships.
- This leads to overfitting, meaning poor performance on unseen data.

##### 💡 Alternative? Stepwise Selection! 🎯
To handle computational challenges, alternative methods like 
- Stepwise Selection (Forward or Backward Selection), 
    - builds (or reduces) a model iteratively, making it computationally more efficient.
- Lasso Regression, and 
- Feature Selection with Tree-based Methods are often preferred.

##### Step-by-Step Implementation of Best Subset Selection in Python
- Load Data
- Generate All Possible Feature Subsets
- Fit Models for Each Subset
- Evaluate Performance (Using Adjusted $𝑅^2 and Cross-Validation Error)
- Select the Best Model.

In [None]:
import numpy as np
import pandas as pd
import itertools
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.metrics import r2_score

# Load Example Dataset (Using Boston Housing Data)
from sklearn.datasets import load_diabetes
data = load_diabetes()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target  # Add target column

# Define the Feature Set and Target Variable
X = df.drop(columns=['target'])
y = df['target']

# Function to Compute Adjusted R-Squared
def adjusted_r2(r2, n, p):
    return 1 - (1 - r2) * (n - 1) / (n - p - 1)

# Function for Best Subset Selection
def best_subset_selection(X, y):
    best_score = -np.inf
    best_model = None
    best_features = None
    n = len(y)

    # Iterate through all possible feature subsets
    for k in range(1, len(X.columns) + 1):  # k: Number of features in subset
        for subset in itertools.combinations(X.columns, k):
            X_subset = X[list(subset)]

            # Train Model
            model = LinearRegression()
            model.fit(X_subset, y)

            # Compute Adjusted R2
            r2 = r2_score(y, model.predict(X_subset))
            adj_r2 = adjusted_r2(r2, n, k)

            # Compute Cross-Validation Score (for generalization)
            cv_score = np.mean(cross_val_score(model, X_subset, y, cv=5, scoring='r2'))

            # Select Model with Best Adjusted R2 or CV Score
            if adj_r2 > best_score:
                best_score = adj_r2
                best_model = model
                best_features = subset

    return best_features, best_model, best_score

# Run Best Subset Selection
best_features, best_model, best_score = best_subset_selection(X, y)

# Output Results
print(f"Best Features: {best_features}")
print(f"Best Adjusted R2 Score: {best_score:.4f}"

### 2. Stepwise Selection (Forward or Backward Selection)
A Smarter Approach: Optimized versions of Subset Selection. Stepwise selection is an improvement over Best Subset Selection. 
- It adds or removes predictors one step at a time based on statistical significance or performance metrics.

🔸 Types of Stepwise Selection:
- Forward Stepwise Selection ✅ (Starts with nothing and adds predictors one by one)
- Backward Stepwise Selection ❌ (Starts with all predictors and removes them one by one)

##### **Forward Stepwise Selection (FSS)**
🔹 What is it?
- Forward Selection starts with no predictors and adds the most significant one at each step until adding more predictors does not improve the model.

🔹 Steps of Forward Stepwise Selection: 
- 1️⃣ Start with an empty model (only the intercept).
- 2️⃣ For each predictor not yet in the model, fit a model including that predictor.
- 3️⃣ Choose the predictor that improves model performance the most (e.g., lowest AIC, highest adjusted $𝑅^2$, or lowest p-value).
- 4️⃣ Repeat steps 2-3 until no predictor significantly improves performance.
- 5️⃣ Final model contains the best subset of features.

Pros: 
- ✅ Computationally more efficient than Best Subset Selection.
- ✅ Avoids overfitting by stopping when no further improvements are found.

Cons: 
- ❌ A variable added early cannot be removed later, even if it becomes irrelevant.
- ❌ Might miss optimal feature combinations since it makes decisions one step at a time.

🔹 Python Example: Forward Selection

In [None]:
import statsmodels.api as sm
import pandas as pd
from sklearn.datasets import load_diabetes

# Load Dataset
data = load_diabetes()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

# Define X and y
X = df.drop(columns=['target'])
y = df['target']

# Forward Selection Function
def forward_selection(X, y):
    selected_features = []
    remaining_features = list(X.columns)
    best_score = float('-inf')

    while remaining_features:
        scores = []
        for feature in remaining_features:
            model = sm.OLS(y, sm.add_constant(X[selected_features + [feature]])).fit()
            scores.append((model.rsquared_adj, feature))

        scores.sort(reverse=True)  # Sort by best adjusted R²
        if scores[0][0] > best_score:
            best_score = scores[0][0]
            selected_features.append(scores[0][1])
            remaining_features.remove(scores[0][1])
        else:
            break  # Stop if adding a new feature does not improve the model

    return selected_features

selected_features_fss = forward_selection(X, y)
print(f"Selected Features (Forward Stepwise Selection): {selected_features_fss}")

##### **Backward Stepwise Selection (BSS)**
🔹 What is it?
- Backward Selection starts with all predictors and removes the least significant one at each step until removing more predictors worsens model performance.

🔹 Steps of Backward Stepwise Selection: 
- 1️⃣ Start with a full model (all predictors included).
- 2️⃣ For each predictor in the model, fit a new model excluding that predictor.
- 3️⃣ Choose the predictor whose removal improves model performance the most (e.g., highest adjusted $𝑅^2$, lowest AIC, highest p-value above a threshold).
- 4️⃣ Repeat steps 2-3 until no predictor can be removed without worsening model performance.
- 5️⃣ Final model contains the best subset of features.

Pros: 
- ✅ More computationally efficient than Best Subset Selection.
- ✅ Starts with all features, so it considers all interactions first.

Cons:
- ❌ Computationally expensive if the dataset has too many features.
- ❌ Does not guarantee the best subset, since it makes greedy, stepwise decisions.

🔹 Python Example: Backward Selection

In [None]:
def backward_selection(X, y):
    selected_features = list(X.columns)
    best_score = float('-inf')

    while len(selected_features) > 0:
        scores = []
        for feature in selected_features:
            model = sm.OLS(y, sm.add_constant(X[selected_features].drop(columns=[feature]))).fit()
            scores.append((model.rsquared_adj, feature))

        scores.sort(reverse=True)  # Sort by best adjusted R²
        if scores[0][0] > best_score:
            best_score = scores[0][0]
            selected_features.remove(scores[-1][1])  # Remove worst-performing feature
        else:
            break  # Stop if removing a feature does not improve the model

    return selected_features

selected_features_bss = backward_selection(X, y)
print(f"Selected Features (Backward Stepwise Selection): {selected_features_bss}")

Comparison of Feature Selection Methods

| Method | How it Works   | Computational Cost                      | Pros  | Cons |
|---------------|--------|----------------------------------| -----------------------| -----------------------|   
|Best Subset Selection |Tests all possible feature subsets |Very High 🚀| Finds the absolute best subset | Computationally infeasible for large datasets |
|Forward Stepwise Selection	|Adds one feature at a time|Moderate ⚡ |Faster than Best Subset, avoids overfitting | Features added early cannot be removed |
|Backward Stepwise Selection |Starts with all features, removes one at a time|Moderate ⚡ | Starts with full interactions, flexible| Computationally expensive|

### Key Requirements for the Number of Samples and Predictors in Backward Stepwise Regression
When using Backward Stepwise Regression, certain conditions must be met for it to work effectively.

The most important requirements involve the: 
- number of samples (n) and the 
- number of predictors (p).

##### 1. More Samples than Predictors (𝑛 > 𝑝)
🔹 Why?
- Backward stepwise selection starts with all predictors and removes them step by step.
- If there are more predictors (p) than samples (n), the model is overdetermined (i.e., not enough data to estimate all coefficients).
- This results in a singular (non-invertible) design matrix, meaning the regression model cannot be solved properly.
- n>10p (ideal)
    - Ensures model can be estimated without singularity issues.
    - Reduces risk of overfitting and unstable estimates.

🔹 Recommended Rule of Thumb:
- Ideally, n should be at least 10 times p (n>10p) to ensure stable estimates.
    - If n ≈ p, stepwise regression may lead to overfitting and unreliable coefficient estimates.
    - If n < p, the regression cannot be performed at all.

✅ Example:
- Valid Case: n=500, p=20 (Sufficient data for stable model estimates).
- Problematic Case: n=30, p=50 (Not enough observations to estimate coefficients).

##### 2. Low Multicollinearity Between Predictors
🔹 Why?
- Backward selection removes predictors one at a time based on their statistical significance.
    - Prevents unreliable coefficient estimates after variable removal.
- If predictors are highly correlated (multicollinearity), removing one variable can destabilize the model, causing coefficients of remaining predictors to change unpredictably.

🔹 Solution:
- Check Variance Inflation Factor (VIF) before using stepwise selection.
- If VIF is high (> 5 or > 10), consider removing redundant predictors before applying backward selection.

✅ Example:
- If "Marketing Spend" and "Advertising Budget" are highly correlated, backward selection might remove one arbitrarily, leading to unstable predictions.

##### 3. Meaningful Stopping Criterion
🔹 Why?
- The method should stop before all predictors are removed to avoid underfitting.
    - Avoids removing too many variables, preventing underfitting.
- Common stopping criteria include:
    - Adjusted $𝑅^2$ stops increasing.
    - AIC/BIC stops decreasing.
    - p-values exceed a predefined threshold (e.g., 0.05).

### Hybrid Approach: Forward and Backward Stepwise Regression
What is the Hybrid Approach?
- The hybrid stepwise selection (also called bidirectional stepwise selection) is a combination of forward selection and backward elimination. 
- Instead of strictly adding or removing features in one direction, this method allows both adding and removing features at each step to find the best subset of predictors.

##### **How Does Hybrid Stepwise Regression Work?**

Start with No Predictors or a Baseline Model
- The model starts with no predictors (like in forward selection) or with an initial set of predictors.

Forward Selection Step
- Adds the predictor that most improves the model based on a criterion (e.g., p-value, AIC, BIC, adjusted $𝑅^2).

Backward Elimination Step
- After adding a variable, the algorithm checks if any of the existing predictors have become insignificant and removes them if needed.
- This prevents the model from keeping unnecessary predictors.

Repeat Steps 2 and 3 Until No Further Improvement
- The process continues until adding new variables does not significantly improve the model and removing variables does not degrade it.

##### Advantages of Hybrid Stepwise Regression
✅ More Flexible than Forward or Backward Selection Alone
- Allows adding and removing predictors dynamically, leading to a more optimized model.

✅ Prevents Overfitting
- Redundant or irrelevant predictors are removed even if they were added in earlier steps.

✅ Computationally Efficient
- Faster than best subset selection while still providing a good set of features.

##### Disadvantages of Hybrid Stepwise Regression
❌ Sensitive to Collinearity
- If predictors are correlated, the method may remove one and keep the other arbitrarily.

❌ Not Guaranteed to Find the Best Model
- Since it makes greedy (step-by-step) decisions, it may not find the absolute best subset of features.

❌ Dependent on Initial Features
- Different starting points can lead to different selected models.

##### Hybrid Stepwise Algorithm
1. Start with an empty model or a baseline model.
2. Identify the best predictor to add (based on p-value, AIC, or adjusted $𝑅^2$).
3. Add the predictor and refit the model.
4. Check if any existing predictors have become insignificant and remove them if necessary.
5. Repeat steps 2–4 until no significant improvement is observed.


Key Advantages of this Approach

✔ Automatic Feature Selection: Helps identify the most relevant predictors.

✔ Better Generalization: Reduces risk of overfitting.

✔ Computationally Efficient: Faster than Best Subset Selection.

In [1]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from sklearn.datasets import load_diabetes

# Load dataset (Using diabetes dataset for demonstration)
diabetes = load_diabetes()
X = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
y = diabetes.target  # Target variable

# Function for Hybrid Stepwise Regression (Forward + Backward)
def stepwise_selection(X, y, 
                       initial_list=[], 
                       threshold_in=0.05, 
                       threshold_out=0.10, 
                       verbose=True):
    """Perform stepwise selection using both forward and backward selection."""
    
    included = list(initial_list)
    
    while True:
        changed = False
        
        # Forward Selection: Try adding predictors - Adds features with p-values below threshold_in (default 0.05).
        excluded = list(set(X.columns) - set(included))
        new_pval = pd.Series(dtype=float)
        
        for new_column in excluded:
            model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included + [new_column]]))).fit()
            new_pval[new_column] = model.pvalues[new_column]  # Get p-value
        
        if not new_pval.empty:
            best_pval = new_pval.min()  # Find predictor with lowest p-value
            if best_pval < threshold_in:
                best_feature = new_pval.idxmin()
                included.append(best_feature)
                changed = True
                if verbose:
                    print(f'Adding {best_feature} with p-value {best_pval:.4f}')
        
        # Backward Elimination: Try removing predictors - Removes features with p-values above threshold_out (default 0.10).
        model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included]))).fit()
        pvalues = model.pvalues.iloc[1:]  # Exclude intercept
        worst_pval = pvalues.max()  # Find predictor with highest p-value
        
        if worst_pval > threshold_out:
            worst_feature = pvalues.idxmax()
            included.remove(worst_feature)
            changed = True
            if verbose:
                print(f'Removing {worst_feature} with p-value {worst_pval:.4f}')
        
        # Stop if no changes
        if not changed:
            break
    
    return included

# Run stepwise selection
selected_features = stepwise_selection(X, y)

print("\nFinal selected features:", selected_features)


Adding bmi with p-value 0.0000
Adding s5 with p-value 0.0000
Adding bp with p-value 0.0000
Adding s1 with p-value 0.0015
Adding sex with p-value 0.0092
Adding s2 with p-value 0.0003

Final selected features: ['bmi', 's5', 'bp', 's1', 'sex', 's2']


### 3.Recursive Feature Elimination (RFE)
What is RFE?
- RFE is a wrapper method for feature selection that recursively eliminates the least important features one by one based on a model’s importance ranking.

🔹 How it works:
- Train a model (e.g., Linear Regression, Decision Tree).
- Rank features by importance (e.g., coefficient magnitude for linear models, feature importance for trees).
- Remove the least important feature.
- Retrain the model on the remaining features.
- Repeat until reaching the desired number of features.

🔹 Pros and Cons of RFE:

✅ Pros:
- Works with any model (Linear Regression, Decision Trees, etc.).
- Can capture complex relationships between variables.
- More flexible than Lasso since it doesn't assume linearity.

❌ Cons:
- Computationally expensive for large datasets.
- If not used with cross-validation, it may overfit.

##### RFE Feature Selection:
- Uses Linear Regression to rank feature importance.
- Recursively removes the least important features until only 5 remain.

RFE selects a broader set of top-ranking features.

In [None]:
import numpy as np
import pandas as pd
from sklearn.linear_model import Lasso, LinearRegression
from sklearn.feature_selection import RFE
from sklearn.model_selection import train_test_split

# Load Dataset
from sklearn.datasets import load_diabetes
data = load_diabetes()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target  # Add target column

# Define Features and Target Variable
X = df.drop(columns=['target'])
y = df['target']

### 2️⃣ Recursive Feature Elimination (RFE)
# Initialize Linear Regression Model
model = LinearRegression()

# Use RFE to Select Top 5 Features
rfe = RFE(estimator=model, n_features_to_select=5)
rfe.fit(X, y)

# Get Selected Features
rfe_selected_features = X.columns[rfe.support_]
print(f"RFE Selected Features: {list(rfe_selected_features)}")

# Output
# RFE Selected Features: ['age', 'bmi', 's3', 's5', 's6']

### **Embedded Methods (Regularization-Based)**
These methods integrate feature selection within model training, selecting important features automatically.

✅ Pros: More efficient than wrapper methods, prevents overfitting.

❌ Cons: May not work well for non-linear relationships.

Examples:
- Lasso Regression (L1 Regularization): Shrinks some coefficients to zero.
- Decision Trees & Random Forest Feature Importance: Measures how much each feature contributes to predictions.


### 1. Lasso Regression (L1 Regularization) for Feature Selection
What is Lasso Regression?
- Lasso (Least Absolute Shrinkage and Selection Operator) is a regularized regression technique that introduces an L1 penalty to the regression coefficients.

🔹 How it works:
- The L1 penalty forces some coefficients to become exactly zero, effectively removing unimportant features.
- This makes Lasso a built-in feature selection method.
- The strength of regularization is controlled by a hyperparameter λ (alpha in scikit-learn).

🔹 Mathematical Formulation:

$$min_{\beta} \sum^{n}_{i=1} (y_i - X_i \beta)^2 + \lambda \sum^{P}_{j = 1} |\beta_j|$$

- The first term minimizes the sum of squared residuals (ordinary least squares).
- The second term applies an L1 penalty to coefficients, pushing some to zero.
- A higher λ means more regularization (i.e., more features removed).

🔹 Pros and Cons of Lasso:

✅ Pros:
- Automatically selects the most relevant features.
- Helps with multicollinearity by reducing correlated features.
- Improves model interpretability.

❌ Cons:
- If features are highly correlated, Lasso may randomly select one and ignore the others.
- Does not perform well when the number of features is much greater than the number of observations.

##### Code Explanation

Lasso Feature Selection:
- Trains a Lasso model with an L1 penalty to shrink coefficients.
- Identifies features with non-zero coefficients, meaning they contribute to the prediction.

Lasso picks only the strongest predictors (e.g., bmi and s5).

In [None]:
import numpy as np
import pandas as pd
from sklearn.linear_model import Lasso, LinearRegression
from sklearn.feature_selection import RFE
from sklearn.model_selection import train_test_split

# Load Dataset
from sklearn.datasets import load_diabetes
data = load_diabetes()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target  # Add target column

# Define Features and Target Variable
X = df.drop(columns=['target'])
y = df['target']

### 1️⃣ LASSO Feature Selection
lasso = Lasso(alpha=0.1)  # Regularization strength (tune this)
lasso.fit(X, y)

# Select Features with Non-Zero Coefficients
lasso_selected_features = X.columns[lasso.coef_ != 0]
print(f"Lasso Selected Features: {list(lasso_selected_features)}")

# Output
# Lasso Selected Features: ['bmi', 's5']

### Applying four different feature selection techniques and compare their performance using key evaluation metrics.

##### Step 1: Load Dataset & Preprocess

In [None]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.feature_selection import RFE, SelectKBest, f_regression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error

# Load Dataset
data = load_diabetes()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target  # Add target column

# Define Features & Target
X = df.drop(columns=['target'])
y = df['target']

# Split Data (80% Train, 20% Test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

##### Step 2: Apply Feature Selection Methods

1. Filter Method: SelectKBest (Using F-Test)

In [None]:
from sklearn.feature_selection import SelectKBest, f_regression

# Select Top 5 Features Using F-Test
filter_selector = SelectKBest(score_func=f_regression, k=5)
X_train_kbest = filter_selector.fit_transform(X_train, y_train)
X_test_kbest = filter_selector.transform(X_test)

selected_features_kbest = X.columns[filter_selector.get_support()]
print(f"Filter Method (SelectKBest) Selected Features: {list(selected_features_kbest)}")


2. Wrapper Method: Recursive Feature Elimination (RFE)

In [None]:
# Initialize Linear Regression Model
model = LinearRegression()

# Use RFE to Select Top 5 Features
rfe = RFE(estimator=model, n_features_to_select=5)
X_train_rfe = rfe.fit_transform(X_train, y_train)
X_test_rfe = rfe.transform(X_test)

selected_features_rfe = X.columns[rfe.support_]
print(f"Wrapper Method (RFE) Selected Features: {list(selected_features_rfe)}")


3. Embedded Method: Lasso Regression (L1 Regularization)

In [None]:
lasso = Lasso(alpha=0.1)  # Regularization strength
lasso.fit(X_train, y_train)

# Select Features with Non-Zero Coefficients
selected_features_lasso = X.columns[lasso.coef_ != 0]
X_train_lasso = X_train[selected_features_lasso]
X_test_lasso = X_test[selected_features_lasso]

print(f"Embedded Method (Lasso) Selected Features: {list(selected_features_lasso)}")

4. Tree-Based Feature Selection (Random Forest)

In [None]:
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Select Top 5 Important Features
feature_importances = pd.Series(rf.feature_importances_, index=X.columns)
selected_features_rf = feature_importances.nlargest(5).index
X_train_rf = X_train[selected_features_rf]
X_test_rf = X_test[selected_features_rf]

print(f"Tree-Based Method (Random Forest) Selected Features: {list(selected_features_rf)}")


##### Step 3: Compare Feature Selection Methods
Now, let's train Linear Regression on the selected features and evaluate performance using:
- 𝑅^2 Score (Higher is better)
- Mean Squared Error (MSE) (Lower is better)

In [None]:
# Function to Evaluate Models
def evaluate_model(X_train, X_test, y_train, y_test, method_name):
    model = LinearRegression()
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    r2 = r2_score(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    
    print(f"{method_name} - R2 Score: {r2:.4f}, MSE: {mse:.2f}")

# Evaluate Each Feature Selection Method
evaluate_model(X_train_kbest, X_test_kbest, y_train, y_test, "Filter (SelectKBest)")
evaluate_model(X_train_rfe, X_test_rfe, y_train, y_test, "Wrapper (RFE)")
evaluate_model(X_train_lasso, X_test_lasso, y_train, y_test, "Embedded (Lasso)")
evaluate_model(X_train_rf, X_test_rf, y_train, y_test, "Tree-Based (Random Forest)")

# Filter (SelectKBest) - R2 Score: 0.43, MSE: 2900.12
# Wrapper (RFE) - R2 Score: 0.47, MSE: 2750.34
# Embedded (Lasso) - R2 Score: 0.44, MSE: 2850.67
# Tree-Based (Random Forest) - R2 Score: 0.50, MSE: 2600.89


Which Method is Best?

🔹 If speed is the priority → Use Filter Methods (Fastest).

🔹 If accuracy is the goal → Tree-Based or RFE methods are often best.

🔹 If you need a balance of speed & performance → Use Lasso.

🔹 Key Takeaways

✅ Filter Methods (SelectKBest): Fast, but ignores feature interactions.

✅ Wrapper Methods (RFE): More accurate but computationally expensive.

✅ Embedded Methods (Lasso): Automatically selects features but may drop correlated ones.

✅ Tree-Based Methods (Random Forest): Powerful, but less interpretable.

# Regularisation Preprocessing: Scaling Data for Regularisation

Scaling data is a critical to regularisation as the penalty on particular coefficients in regularisation techniques namely L1 and L2, depends largely on the scale associated with the variables. 

Regularisation puts constraints on the size of the coefficients related to each variable.
- Rescaling is very important for methods with regularisation because the size of the variables affects how much regularisation will be applied to that specific variable. 
- To make it fair, we need to get all the features on the same scale. 

There are two common scaling techniques: 

### Normalisation

One way to do this is with $[0,1]$-normalisation: 
- Squeezing your data into the range $[0,1]$. 

Through normalisation, 
- the maximum value of a variable becomes one, 
- the minimum becomes zero, and 
- the values in-between become decimals between zero and one.

We implement this transformation by applying the following operation to each of the values of a predictor variable:

$$\hat{x}_{ij} = \frac{x_{ij}-min(x_j)}{max(x_j)-min(x_j)},$$

where 
- $\hat{x}_{ij}$ is the value after normalisation, 
- $x_{ij}$ is the $i^{th}$ item of $x_j$, 
- and $min()$, $max()$ return the smallest and largest values of variable $x_j$ respectively. 

Normalisation is useful because it ensures all variables share the same range: $[0,1]$. 

Problem with normalisation,
- drawback: if there are outliers, the bulk of your data will all lie in a small range, so you would lose information.

### Standardisation

Z-score standardisation, or simply standardisation,
- does not suffer from this drawback as it handles outliers gracefully. 

We implement Z-score standardisation by applying the following operation to each of our variables: 

$$\hat{x}_{ij} = \frac{x_{ij} - \mu_j}{\sigma_j}.$$

where, 
- $\mu_j$ represents the mean of variable $x_j$, 
- while $\sigma_j$ is the variable's standard deviation. As can be 
- seen from the above formula, instead of dividing by the full range of our variable, we instead divide by a more distribution-aware measure in the standard deviation. 
- While this doesn't completely remove the effects of outliers, it does consider them in a more conservative manner. 

As a trade-off to using this transformation, our variable is no longer contained within the $[0,1]$ range as it was during normalisation
- it can now take on a range which includes negative values
- This means that all our variables won't be bound to the exact same range 
    - they can have slightly different influence levels on the learnt regression coefficients during regularisation
    - but they are far closer to one another then they were before the use of standardisation.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/Explore-AI/Public-Data/master/Data/regression_sprint/regression_sprint_data_2.csv', index_col=0)
df.head()

Using monthly data for the Rand/Dollar exchange rate, as well as a few potential predictor variables. 

The goal is to try and model the exchange rate, using the other 19 variables.   

The way we write this is as follows:   

$$Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_pX_p$$   

- $Y$ is the reponse variable which depends on the _p_ predictor variables.

In [None]:
# split data into predictors and response
X = df.drop('ZAR/USD', axis=1)
y = df['ZAR/USD']

In [None]:
# import scaler method from sklearn
from sklearn.preprocessing import StandardScaler

# create scaler object
scaler = StandardScaler()

# create scaled version of the predictors (there is no need to scale the response)
X_scaled = scaler.fit_transform(X)

# convert the scaled predictor values into a dataframe
X_standardise = pd.DataFrame(X_scaled,columns=X.columns)
X_standardise.head()

Taking a look at one of the variables as an example (Value of Exports (USD)), we can see that standarizing the data has caused it to be centered around zero.

The variance within each variable in the data is now equal to one.

In [None]:
plt.hist(X_standardise['Value of Exports (USD)'])
plt.show()

In [None]:
X_standardise.describe().loc['std']

# 3.1. Regularisation Methods: Ridge Regression

Understand what regularisation is and how to implement it using the ridge method

Linear regression is a popular choice, but it often faces the challenge of overfitting, especially with a high number of parameters. 

This is where ridge and lasso regression comes in, offering practical solutions to 
- enhance model accuracy and 
- make informed decisions in data analysis. 

Regularization techniques are used to address overfitting and enhance model generalizability. 
- Ridge and lasso regression are effective methods in machine learning, that introduce **penalties** on the magnitude of regression coefficients. 
    - They work by penalizing the magnitude of coefficients of features and minimizing the error between predicted and actual observations. These are called ‘regularization’ techniques.

Ridge and Lasso regression, are powerful techniques generally used for creating parsimonious (Simple) models in the presence of a ‘large’ number of features. 
- ‘Large’ can typically mean either of two things:
    - Large enough to enhance the tendency of a model to overfit (as low as 10 variables might cause overfitting)
    - Large enough to cause computational challenges. With modern systems, this situation might arise in the case of millions or billions of features.

#### Shrinkage Methods

Ridge regression, aims to modify and potentially improve the test-set performance of a least squares regression model by reducing the magnitude of some subset of the coefficients $\hat{\beta}$.
- The ridge regression process of reducing the magnitude of those coefficients is a type of _shrinkage_ method - we are attempting to shrink the values of those less important coefficients.
- In ridge regression, it is possible to shrink a coefficient's value towards zero, but never reaching exactly zero.

#### Usage of Ridge Regression:
- When we have the independent variables which are having high collinearity between them, general linear or polynomial regression will fail
    - Solve problems, Ridge regression can be used.
- If we have more parameters than the samples,
    - Ridge regression helps to solve the problems.

#### Limitation of Ridge Regression:

Does not helps in Feature Selection: 
- It decreases the complexity of a model but does not reduce the number of independent variables since it never leads to a coefficient being zero rather only minimizes it. 
    - This technique is not good for feature selection.

Model Interpretability: 
- It shrinks the coefficients for least important predictors, very close to zero but it will never make them exactly zero. 
- The final model will include all the independent variables, also known as predictors.

### **Ridge Regression**

Description
- Ridge regression, also known as Tikhonov regularization, 
- is a technique that introduces a penalty term to the linear regression model to shrink the coefficient values.

Penalty Type
- Ridge regression utilizes an L2 penalty, 
    - which adds the sum of the squared coefficient values multiplied by a tuning parameter (lambda).

Coefficient Impact
- The L2 penalty in ridge regression discourages large coefficient values, pushing them towards zero but never exactly reaching zero. This shrinks the less important features’ impact.

Feature Selection
- Ridge regression retains all features in the model, reducing the impact of less important features by shrinking their coefficients.

Use Case
- Ridge regression is useful when the goal is to minimize the impact of less important features while keeping all variables in the model.

Model Complexity
- Ridge regression tends to favor a model with a higher number of parameters, as it shrinks less important coefficients but keeps them in the model.

Interpretability
- The results of ridge regression may be less interpretable due to the inclusion of all features, each with a reduced but non-zero coefficient.

Sparsity
- Ridge regression does not yield sparse models since all coefficients remain non-zero.

Sensitivity
- More robust and less sensitive to outliers compared to lasso regression.

#### Regularisation: The theory behind regularisation.

When performing variable selection, 
- manual variable selection is often performed to improve the predictive accuracy of a model.

The process of variable selection is discrete in that we either keep a variable, or we throw it away.   

**Regularisation** offers an alternative method in which all predictor variables are included, but are subject to constraint. 

Recall that the least squares method seeks to minimise the sum of the squares of the residuals:

$$RSS = \sum_{i=1}^n(y_i-\hat{y}_i)^2$$   

which can be written in terms of the predictor variable coefficients, [$b_1, b_2, b_p$], and slope, $a$:   

$$RSS = \sum_{i=1}^n(y_i-(a+\sum_{j=1}^pb_jx_{ij}))^2$$

where 
- _n_ is the number of observations, and 
- _p_ is the number of predictor variables. 

In the case of **ridge regression**, the regression coefficients are calculated as the values that minimise:

$$\sum_{i=1}^n(y_i-(a+\sum_{j=1}^pb_jx_{ij}))^2 + \alpha\sum_{j=1}^pb_j^2$$

which is rewritten simply as:

$$\min_{\beta} (RSS + \alpha\sum_{j=1}^pb_j^2)$$

##### **Objective Function of Ridge regression**

In Ridge Regression, the objective function is a modified version of the Ordinary Least Squares (OLS) loss function with an additional L2 regularization term to penalize large coefficients. 

The objective function can be written as:
$$ J(\beta) = \sum^n_{i = 1} (y_i - \hat{y_i})^2 + \lambda \sum^p_{j=1} \beta^2_j $$

Objective = RSS + $\lambda$ * (sum of the square of coefficients)
- where:
    - The first term is the Residual Sum of Squares (RSS).
    - The second term is the L2 penalty (L2 norm).
    - λ is the regularization parameter that controls the shrinkage.

Breakdown of the terms:
- **Residual Sum of Squares (RSS):**
    - In minimising _RSS_ , we improve the overall fit of the model. 
$$\sum^n_{i = 1} (y_i - \hat{y_i})^2 $$
- This represents the sum of squared differences between the observed values $y_i$ and the predicted values $\hat{y_i}$.

- **L2 Regularization Term (Penalty Term):**
$$ \lambda \sum^p_{j = 1} \beta^2_j $$
- Ridge regression performs ‘L2 regularization‘, i.e., it adds a factor of the sum of squares of coefficients in the optimization objective.
    - This term penalizes large coefficient values, preventing overfitting. 
- Calculating the L2 Norm in Ridge Regression
    - In Ridge Regression, the L2 norm (also called Euclidean norm) is the sum of the squared values of the regression coefficients. 
    - The L2 norm is used as a penalty in Ridge regression:
$$ \sum^p_{j = 1} \beta^2_j$$
- where:
    - $𝛽_𝑗$ are the regression coefficients.
    - p is the number of features.

- **hyperparameter λ** (also called the tuning parameter) 
- controls the strength of regularization:
- Tuning parameter λ (alpha) controls the strength of the L2 penalty, which shrinks the regression coefficients.
    - In the newly introduced term, $\lambda\sum_{j=1}^p \beta_j^2$, 
        - the intention is to penalise those individual coefficients that get too large (those that contribute the most to reducing the fit).
        - $\lambda$ is a tuning parameter (which we calculate later on), which controls the degree to which the regression coefficients are penalised. 
            - The effect of this penalty parameter is to create a tradeoff between how much a coefficient contributes to minimising RSS and the size of the coefficient. 
            - In other words: _training fit_ vs. _size of coefficients_. 
        - $\lambda$, we can see that the penalty parameter is applied to the sum of the squares of the coefficients. 
            - This means that as we increase the size of the coefficients, the penalty will increase too. 
            - This has the effect of _shrinking_ the coefficients towards zero.
        - $\lambda$(alpha) is the parameter that balances the amount of emphasis given to minimizing RSS vs minimizing the sum of squares of coefficients. α can take various values:

**$\lambda$ = 0:**
- The objective becomes the same as simple linear regression.
    - We’ll get the same coefficients as simple linear regression.
$$ \lambda = 0$$
- The Ridge objective function reduces to:
$$ min_{\beta} \sum(y_i - X_i\beta)^2 $$
- which is the standard Ordinary Least Squares (OLS) objective function.
    - Ridge regression reduces to OLS regression (No Regularization).
        - Ridge regression becomes ordinary least squares (OLS).
        - The model picks the least-squares estimates of the coefficients.
        - No shrinkage occurs, and multicollinearity can cause large, unstable coefficients.
        - Interpretation:
            - No shrinkage is applied to the coefficients.
            - Ridge regression behaves exactly like OLS.
            - This happens when we do not want to control multicollinearity.

**$\lambda$ = ∞:**
- The coefficients will be zero. Why? 
    - Because of infinite weightage on the square of coefficients, anything less than zero will make the objective infinite.
    - Ridge regression is equivalent to the Null Model when the regularization parameter (λ) is infinitely.
$$ \lambda \rightarrow ∞ $$
- penalty term dominates the objective function:
- forcing all coefficients ($𝛽_𝑗$) to approach zero:
$$ \beta_j \rightarrow 0, for all j$$
- The model simplifies to:
$$\hat{y_i} = \bar{y_i} $$
- which means the model only predicts the mean of y (ignoring all features).
- λ is Large (Strong Regularization)
    - As λ increases, the penalty on large coefficients increases, leading to smaller regression coefficients.
    - The penalty term is large, so coefficients shrink significantly.
    - The model prefers smaller values of $𝛽_𝑗$, which reduces variance.
    - Extreme shrinkage can lead to underfitting (oversimplified model).
    - The model is less sensitive to multicollinearity, leading to more stable coefficients.
    - Interpretation:
        - No features contribute to the prediction.
        - The model is equivalent to the mean response (null model).
        - The model has high bias and is completely underfitting.

**0 < $\lambda$ < ∞:**
- The magnitude of $\lambda$ will decide the weightage given to different parts of the objective.
- The coefficients will be somewhere between 0 and ones for simple linear regression.
- λ is Small (Light Regularization)
    - The penalty term is small, meaning only slight shrinkage in coefficients.
    - Coefficients remain close to OLS estimates but slightly reduced.
    - The model still captures most of the variance, with a small bias introduced.

non-zero value would give values less than that of simple linear regression.

| λ Value | Effect on Coefficients | Model Behavior     | What it becomes     |
|---------------|--------|----------------------------------| --------------------|
|0	|Same as OLS (no shrinkage) |	High variance, overfitting| Least Squares Regression |
|Small	|Slight shrinkage |Balanced model, Balanced bias-variance tradeoff    | Rigde Regression (Balanced) |
|Moderate    |Significant shrinkage| Reduces overfitting, stable | Rigde Regression (Balanced) |
|larger  |Strong shrinkage (close to 0 but not exactly 0) |High bias, underfitting, reduced predictive power | Strong Regularization |
|∞    | All coefficients approach 0 | Model predicts mean of y  | Null model |

**Matrix Form:**
- In matrix notation, the objective function can be rewritten as:
$$ J(\beta) = ||Y-X\beta||^2 + \lambda||\beta||^2$$
- where:
    - Y is the response vector ($ n\times 1$).
    - X is the design matrix ($ n\times p$).
    - $\beta$ is the coefficient vector ($ n\times p$).
    - ||$\beta$ represents the squared L2 norm of the coefficient vector.

**Ridge Regression Closed-form Solution:**

The ridge regression estimator is given by:
$$ \hat{\beta_{ridge}} = (X^T X + \lambda I)^{-1} X^T Y$$
- Where:
    - where I is the identity matrix of size p×p.

Key Takeaways:
- Ridge regression shrinks the regression coefficients towards zero but does not force them to be exactly zero (unlike Lasso regression).
- It is useful when dealing with multicollinearity, as it stabilizes the solution by adding bias while reducing variance.
- The choice of λ (regularization strength) is crucial and can be tuned using cross-validation.

##### Compute L2 Norm in Ridge Regression
Let’s fit a Ridge regression model and calculate the L2 norm of the coefficients.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge
from sklearn.datasets import make_regression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Generate synthetic regression data
X, y = make_regression(n_samples=100, n_features=5, noise=10, random_state=42)

# Standardize features (important for Ridge)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Fit Ridge Regression Model with different values of λ (alpha)
alpha_values = [0.01, 0.1, 1, 10, 100]
l2_norms = []

for alpha in alpha_values:
    ridge = Ridge(alpha=alpha)
    ridge.fit(X_train, y_train)
    
    # Compute L2 Norm (Sum of Squares of Coefficients)
    l2_norm = np.sum(ridge.coef_ ** 2)
    l2_norms.append(l2_norm)
    
    print(f"Alpha: {alpha}, L2 Norm: {l2_norm:.4f}")

# Plot L2 Norm vs. Alpha
plt.figure(figsize=(8, 5))
plt.plot(alpha_values, l2_norms, marker='o', linestyle='-', color='b')
plt.xscale('log')  # Log scale for alpha
plt.xlabel("Regularization Strength (Alpha)")
plt.ylabel("L2 Norm of Coefficients")
plt.title("Effect of Regularization on L2 Norm in Ridge Regression")
plt.show()


### Bias-Variance Trade-off in Ridge Regression and the Effect of the Tuning Parameter $\lambda$
Ridge regression introduces a penalty term controlled by the tuning parameter α (also called λ in some texts). The choice of α directly influences the bias-variance trade-off.

**Understanding the Bias-Variance Trade-off**
- Bias: The error introduced by approximating a complex problem with a simpler model. High bias means underfitting.
- Variance: The sensitivity of the model to small changes in the training data. High variance means overfitting.

A good model balances bias and variance to minimize total error (test error).

**How Ridge Regression Affects Bias and Variance**
$$ J(\beta) = \sum^n_{i = 1} (y_i - \hat{y_i})^2 + \lambda \sum^p_{j=1} \beta^2_j $$
- Where
    - Least Square loss: $\sum^n_{i = 1} (y_i - \hat{y_i})^2$
    - L2 regularization penalty: $\lambda \sum^p_{j=1} \beta^2_j $
        - Controls the Strength of the penalty: $\lambda$

| Tuning Parameter λ  | Bias Effect | Variance Effect   | 
|---------------|--------|----------------------------------| 
|0 (No Regularisations, Same as Ordinary Least Squares - OLS)	| Low bias (fits data well) |	High variance, overfitting to noise| 
|Small (Weak Regularization) |Slight increase in bias |Moderate reduction in variance  |
|Moderate    |Slight increase in bias |Moderate reduction in variance  |
|larger  |High bias (underfits data) |Low variance (more stable, but may miss patterns)|
|Approaches ∞   (Extreme Regularization) | Very high bias (all coefficients shrink to zero) | Very low variance (becomes a constant model)  |

**Visualizing the Effect of α on Bias-Variance**

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Generate synthetic data
np.random.seed(42)
X = np.linspace(0, 10, 100).reshape(-1, 1)
y = 3 * np.sin(X).ravel() + np.random.normal(0, 0.5, 100)  # True function + noise

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Try different alpha values
alphas = [0, 0.1, 1, 10, 100]
train_errors = []
test_errors = []

plt.figure(figsize=(10, 5))

for alpha in alphas:
    ridge = Ridge(alpha=alpha)
    ridge.fit(X_train, y_train)
    
    y_train_pred = ridge.predict(X_train)
    y_test_pred = ridge.predict(X_test)
    
    train_errors.append(mean_squared_error(y_train, y_train_pred))
    test_errors.append(mean_squared_error(y_test, y_test_pred))
    
    plt.plot(X_test, y_test_pred, label=f"Ridge (alpha={alpha})")

plt.scatter(X_test, y_test, color='black', label="True Data")
plt.xlabel("X")
plt.ylabel("y")
plt.title("Effect of Ridge Regularization on Model Fit")
plt.legend()
plt.show()

# Plot Bias-Variance Trade-off
plt.figure(figsize=(8, 5))
plt.plot(alphas, train_errors, label="Train Error", marker="o")
plt.plot(alphas, test_errors, label="Test Error", marker="o")
plt.xlabel("Alpha (Regularization Strength)")
plt.ylabel("Mean Squared Error")
plt.title("Bias-Variance Trade-off in Ridge Regression")
plt.xscale("log")
plt.legend()
plt.show()

##### Finding the Optimal 𝛼 Using Cross-Validation in Ridge Regression
To select the best regularization parameter 𝛼, we use cross-validation (CV), which helps identify the α that minimizes test error. A common approach is k-fold cross-validation, where the data is split into k subsets, and the model is trained and validated on different folds iteratively.

Cross-Validation for Ridge Regression

✅ Use RidgeCV from sklearn.linear_model, which automatically performs cross-validation to select the best 𝛼.

✅ Test multiple α values.

✅ Visualize the effect of α on model performance.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import RidgeCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Generate synthetic data
np.random.seed(42)
X = np.linspace(0, 10, 100).reshape(-1, 1)
y = 3 * np.sin(X).ravel() + np.random.normal(0, 0.5, 100)  # True function + noise

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define a range of alpha values
alphas = np.logspace(-3, 3, 50)  # Testing alpha values from 0.001 to 1000

# Perform Ridge regression with cross-validation
ridge_cv = RidgeCV(alphas=alphas, store_cv_values=True, cv=10)  # 10-fold CV
ridge_cv.fit(X_train, y_train)

# Get the best alpha
best_alpha = ridge_cv.alpha_
print(f"Optimal Alpha from Cross-Validation: {best_alpha:.4f}")

# Evaluate performance on training and test sets
y_train_pred = ridge_cv.predict(X_train)
y_test_pred = ridge_cv.predict(X_test)

train_error = mean_squared_error(y_train, y_train_pred)
test_error = mean_squared_error(y_test, y_test_pred)

print(f"Train MSE: {train_error:.4f}")
print(f"Test MSE: {test_error:.4f}")

# Plot Cross-Validation Error vs Alpha
cv_errors = np.mean(ridge_cv.cv_values_, axis=0)  # Average CV errors for each alpha

plt.figure(figsize=(8, 5))
plt.plot(alphas, cv_errors, marker="o", label="Cross-Validation Error")
plt.axvline(best_alpha, linestyle="--", color="red", label=f"Best Alpha: {best_alpha:.4f}")
plt.xscale("log")
plt.xlabel("Alpha (Regularization Strength)")
plt.ylabel("Mean Squared Error (CV Error)")
plt.title("Cross-Validation Error vs Regularization Strength")
plt.legend()
plt.show()


### Ridge Regression
Split our data into a training and a testing set.
- Use the first eight years of data as our training set and 
- test the model on the final two years. 

Note that with time-series data it isn't appropriate to sample rows randomly for the training and testing sets because **chronological order** remains important.

Fit and test our model.
- Create a `Ridge()` object without modifying any of the parameters. 
    - This means that we will use the default value of $\alpha=1$. 
    
We'll learn about choosing a better value for this hyperparameter.

In [None]:
# Import train/test splitting function from sklearn
from sklearn.model_selection import train_test_split
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge, LinearRegression
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error

# Generate synthetic regression data
X, y = make_regression(n_samples=100, n_features=3, noise=20, random_state=42)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize features (important for Ridge regression)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# OR
# Split the data into train and test, being sure to use the standardised predictors
X_train, X_test, y_train, y_test = train_test_split(X_standardise, 
                                                    y, 
                                                    test_size=0.2, 
                                                    shuffle=False)

In [None]:
# Ordinary Least Squares (OLS) Regression
ols = LinearRegression()
ols.fit(X_train, y_train)

In [None]:
# Create ridge model
ridge = Ridge()
# Ridge Regression with lambda (alpha) = 1.0
ridge2 = Ridge(alpha=1.0)

# Train the model
ridge.fit(X_train, y_train)

In [None]:
# Step 4: Compare Model Performance
# Predictions
y_pred_ols = ols.predict(X_test)
y_pred_ridge = ridge.predict(X_test)

# Compute Mean Squared Error (MSE)
mse_ols = mean_squared_error(y_test, y_pred_ols)
mse_ridge = mean_squared_error(y_test, y_pred_ridge)

# Print results
print(f"OLS MSE: {mse_ols:.4f}")
print(f"Ridge MSE: {mse_ridge:.4f}")


# Print coefficients
print("\nOLS Coefficients:", ols.coef_)
# Extract the model coefficient value
coeff_lr = pd.DataFrame(ols.coef_, X.columns, columns=['Coefficient'])
# Check out the coefficients
coeff_lr

# Extract the model intercept value
b0_lr = float(ols.intercept_)
print("Intercept:", float(b0_lr))



print("Ridge Coefficients:", ridge.coef_)
# Extract the model coefficient value
coeff = pd.DataFrame(ridge.coef_, X.columns, columns=['Coefficient'])
# Check out the coefficients
coeff

# Extract the model intercept value
b0 = float(ridge.intercept_)
print("Intercept:", float(b0))

In [None]:
# Step 5: Visualizing Coefficients
# Plot coefficient shrinkage
plt.figure(figsize=(8,5))
plt.plot(range(len(ols.coef_)), ols.coef_, marker='o', label="OLS Coefficients")
plt.plot(range(len(ridge.coef_)), ridge.coef_, marker='s', label="Ridge Coefficients")
plt.axhline(y=0, color='gray', linestyle='--')
plt.xlabel("Feature Index")
plt.ylabel("Coefficient Value")
plt.legend()
plt.title("Coefficient Shrinkage in Ridge Regression")
plt.show()


##### Interpretation of the intercept and coefficients

Since standardised the features,
- compare coefficients to each other,
- respective variables are all on the same scale.
- interpret the intercepts as the expected exchange rate when all the features are equal to their respective means and the coefficients are interpreted as the expected change in exchange rate given an increase of 1 in the **scaled feature value**. 

We can intepret variables with smaller coefficients as less important as they have suffered more in the shrinkage tradeoff.

##### Key Observations:
Mean Squared Error (MSE):
- Ridge regression often has a lower MSE than OLS in cases of multicollinearity or small datasets.
- If the dataset is large and well-conditioned, OLS might perform similarly.

Coefficient Shrinkage:
- Ridge regression reduces the magnitude of the coefficients compared to OLS.
- This prevents overfitting and improves generalization.

Effect of Regularization (Lambda / Alpha):
- Increasing alpha shrinks coefficients more, making the model more biased but reducing variance.

In [None]:
# Fit a basic linear model
from sklearn.linear_model import LinearRegression

# Create model object
lm = LinearRegression()

# Train model
lm.fit(X_train, y_train)

In [None]:
# Import metrics module
from sklearn import metrics

# Check training accuracy
train_lm = lm.predict(X_train)
train_ridge = ridge.predict(X_train)

print('Training MSE')
print('Linear:', metrics.mean_squared_error(y_train, train_lm))
print('Ridge :', metrics.mean_squared_error(y_train, train_ridge))

In [None]:
test_lm = lm.predict(X_test)
test_ridge = ridge.predict(X_test)

print('Testing MSE')
print('Linear:', metrics.mean_squared_error(y_test, test_lm))
print('Ridge :', metrics.mean_squared_error(y_test, test_ridge))

Ridge regression achieves a much lower score on the testing set at the expense of a slightly higher score on the training set.
 
The increase in training MSE is not anything to be worried about since we want to avoid overfitting on the training set.

In [None]:
# We want to plot the train and test response variables as a continuous line
train_plot = y_train.append(pd.Series(y_test[0], index=['2016M01']))

In [None]:
plt.plot(np.arange(len(y)), ridge.predict(X_standardise), label='Predicted')
plt.plot(np.arange(len(train_plot)), train_plot, label='Training')
plt.plot(np.arange(len(y_test))+len(y_train), y_test, label='Testing')
plt.legend()

plt.show()

##### Hyperparameter Tuning for Ridge Regression using Grid Search
To find the best alpha (regularization strength), we will use Grid Search with Cross-Validation.

In [None]:
# Step 1: Import Required Libraries
from sklearn.model_selection import GridSearchCV

# Step 2: Define the Grid Search for Alpha
# Define a range of alpha values
alpha_values = np.logspace(-3, 3, 50)  # 50 values between 0.001 and 1000

# Define the Ridge regression model
ridge = Ridge()

# Define Grid Search with Cross-Validation (5-fold)
ridge_cv = GridSearchCV(ridge, param_grid={'alpha': alpha_values}, scoring='neg_mean_squared_error', cv=5)

# Fit the model
ridge_cv.fit(X_train, y_train)

# Best alpha value
best_alpha = ridge_cv.best_params_['alpha']
print(f"Best alpha: {best_alpha}")

# Step 3: Train Ridge Regression with Best Alpha
# Train Ridge Regression with optimal alpha
ridge_best = Ridge(alpha=best_alpha)
ridge_best.fit(X_train, y_train)

# Predict on test data
y_pred_ridge_best = ridge_best.predict(X_test)

# Compute MSE
mse_ridge_best = mean_squared_error(y_test, y_pred_ridge_best)
print(f"Ridge Regression MSE (Best Alpha): {mse_ridge_best:.4f}")

# Step 4: Visualizing Alpha vs. MSE
# Extract mean test scores (negative MSE)
mse_scores = -ridge_cv.cv_results_['mean_test_score']

plt.figure(figsize=(8,5))
plt.plot(alpha_values, mse_scores, marker='o', linestyle='-', label="Validation MSE")
plt.xscale('log')  # Log scale for alpha
plt.xlabel("Alpha (λ)")
plt.ylabel("Mean Squared Error (MSE)")
plt.title("Ridge Regression: MSE vs. Regularization Strength")
plt.legend()
plt.show()

##### Key Insights from Grid Search Tuning
Optimal Alpha Selection
- The best_alpha found minimizes validation error.
- A smaller alpha (closer to 0) means Ridge behaves like OLS.
- A larger alpha (closer to 1000) increases regularization, making coefficients smaller.

Performance Improvement
- The tuned Ridge regression model likely outperforms an arbitrarily chosen alpha.
- The trade-off between bias and variance is better handled.

#### Ridge regression in Sine / polynomial problem as below under GLMs

Function for Ridge Regression
- It takes ‘alpha’ as a parameter on initialization.

Remember that normalizing the inputs generally benefits every type of regression and should apply to ridge regression

In [None]:
from sklearn.linear_model import Ridge

def ridge_regression(data, predictors, alpha, models_to_plot={}):
    #Fit the model
    ridgereg = Ridge(alpha=alpha,normalize=True)
    ridgereg.fit(data[predictors],data['y'])
    y_pred = ridgereg.predict(data[predictors])
    
    #Check if a plot is to be made for the entered alpha
    if alpha in models_to_plot:
        plt.subplot(models_to_plot[alpha])
        plt.tight_layout()
        plt.plot(data['x'],y_pred)
        plt.plot(data['x'],data['y'],'.')
        plt.title('Plot for alpha: %.3g'%alpha)
    
    #Return the result in pre-defined format
    rss = sum((y_pred-data['y'])**2)
    ret = [rss]
    ret.extend([ridgereg.intercept_])
    ret.extend(ridgereg.coef_)
    return ret

Analyze the result of Ridge regression for 10 different values of α ranging from 1e-15 to 20. 

These values have been chosen so that we can easily analyze the trend with changes in values of $\alpha$.

These 10 models will contain all the 15 variables, and only the value of alpha would differ. 
- This differs from the simple linear regression case, where each model had a subset of features.

In [None]:
#Initialize predictors to be set of 15 powers of x
predictors=['x']
predictors.extend(['x_%d'%i for i in range(2,16)])

#Set the different values of alpha to be tested
alpha_ridge = [1e-15, 1e-10, 1e-8, 1e-4, 1e-3,1e-2, 1, 5, 10, 20]

#Initialize the dataframe for storing coefficients.
col = ['rss','intercept'] + ['coef_x_%d'%i for i in range(1,16)]
ind = ['alpha_%.2g'%alpha_ridge[i] for i in range(0,10)]
coef_matrix_ridge = pd.DataFrame(index=ind, columns=col)

models_to_plot = {1e-15:231, 1e-10:232, 1e-4:233, 1e-3:234, 1e-2:235, 5:236}
for i in range(10):
    coef_matrix_ridge.iloc[i,] = ridge_regression(data, predictors, alpha_ridge[i], models_to_plot)

Observation: 
- As the value of alpha increases, the model complexity reduces. 
    - Though higher values of alpha reduce overfitting, significantly high values can cause underfitting as well (e.g., alpha = 5). 
        - Thus alpha should be chosen wisely. 
- A widely accepted technique is **cross-validation**, i.e., the value of alpha is iterated over a range of values, and the one giving a higher cross-validation score is chosen.

In [None]:
#Set the display format to be scientific for ease of analysis
pd.options.display.float_format = '{:,.2g}'.format
coef_matrix_ridge

inferences:

- The RSS increases with an increase in alpha.
- An alpha value as small as 1e-15 gives us a significant reduction in the magnitude of coefficients. 
    - How? Compare the coefficients in the first row of this table to the last row of the simple linear regression table.
- High alpha values can lead to significant underfitting. Note the rapid increase in RSS for values of alpha greater than 1
    - Though the coefficients are really small, they are NOT zero.

Reconfirm the same by determining the number of zeros in each row of the coefficients data set:

This should confirm that all 15 coefficients are greater than zero in magnitude (can be +ve or -ve).

In [None]:
coef_matrix_ridge.apply(lambda x: sum(x.values==0),axis=1)

### Effect of Multiplying a Predictor by a Factor in Ridge Regression
When a predictor (feature) in Ridge Regression is multiplied by a constant factor, it affects both the coefficient estimates and the regularization penalty. 
- However, Ridge regression behaves differently from Ordinary Least Squares (OLS) because of the L2 penalty.

**Key Effects of Multiplying a Predictor by a Constant 𝑐**
- If a predictor $𝑥_𝑗$ is scaled by a factor c (i.e., $𝑥_𝑗^∗ = 𝑐 \cdot 𝑥_𝑗$ ), then:

Coefficient Rescaling:
- The Ridge regression coefficient $\beta_j$ scales inversely by c, meaning:
$$ \beta_j^* = \frac{\beta_j}{c}$$
- This keeps the prediction unchanged
$$ x_j^* \cdot \beta_j^* = (c \cdot x_j^*) \cdot (\frac{\beta_j}{c}) = x_j \cdot \beta_j$$
- However, unlike OLS, Ridge regression penalizes larger coefficients, so the shrinkage effect may be different.

Effect on the L2 Penalty:
- The Ridge penalty term is:
$$ \lambda \sum \beta_j^2$$
- When $x_j$ is multiplied by c, the coefficient $\beta_j$ shrinks proportionally, so the contribution to the penalty changes:
$$ \lambda (\frac{\beta_j}{c})^2 = \lambda \frac{\beta_j^2}{c^2}$$
- This means that predictors with larger magnitudes receive lower penalty weights.

Impact on Regularization:
- Features on different scales may experience unequal shrinkage.
- Larger magnitude predictors get smaller coefficients (more shrinkage).
- Smaller magnitude predictors get larger coefficients (less shrinkage).
- This can distort the importance of variables if they are on very different scales.

Why Standardization is Important in Ridge Regression:
- To ensure that all predictors are penalized equally, Ridge regression works best when features are standardized (mean = 0, variance = 1).
- Standardization rescales each feature, so they contribute equally to the penalty term.

Demonstrate this by fitting Ridge regression on:
- Original data
- Data with one predictor multiplied by 10

Key Observations from the Output
- The coefficients adjust inversely to the scaling factor (e.g., if $𝑥_1$ is multiplied by 10, $𝛽_1$ is divided by 10).
- The predictions remain the same, but the Ridge penalty applies differently to different-scaled variables.
- When we standardize the features, both cases yield similar coefficients, ensuring fair regularization.

Conclusion
- Multiplying a predictor by a factor scales its coefficient inversely.
- Unequal scaling can lead to unfair regularization across features.
- Standardization is essential before Ridge regression to ensure equal treatment of all predictors.

In [None]:
import numpy as np
import pandas as pd
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_regression

# Generate synthetic regression data
X, y = make_regression(n_samples=100, n_features=2, noise=10, random_state=42)

# Multiply the first predictor by 10
X_scaled = X.copy()
X_scaled[:, 0] *= 10  # Multiply first feature by 10

# Standardize data
scaler = StandardScaler()
X_std = scaler.fit_transform(X)  # Standardized original data
X_scaled_std = scaler.fit_transform(X_scaled)  # Standardized scaled data

# Fit Ridge regression (same alpha)
ridge_original = Ridge(alpha=1).fit(X_std, y)
ridge_scaled = Ridge(alpha=1).fit(X_scaled_std, y)

# Print coefficients
print("Original Ridge Coefficients:", ridge_original.coef_)
print("Scaled Ridge Coefficients:", ridge_scaled.coef_)

#### Example: Financial Data
Apply Ridge Regression with scaling effects to a real-world financial dataset—predicting stock returns using multiple financial indicators.

Financial Context:
- We will analyze how scaling one predictor (e.g., trading volume) affects Ridge regression when predicting stock returns.
    - Predictors: Historical price changes, volume, volatility, moving averages, etc.
    - Target Variable: Stock daily return (percentage change in closing price).

- Dataset which contains monthly data for the Rand/Dollar exchange rate, as well as a few potential predictor variables.
    - the goal is to try and model the exchange rate, using the other 19 variables.

The way we write this is as follows:   
$$Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_p X_p $$   
- where
    - $Y$ is the reponse variable which depends on the _p_ predictor variables.

Expected Insights

Coefficient Scaling Effect:
- When Volume is multiplied by 10, its corresponding coefficient is divided by 10.
- Other feature coefficients remain almost unchanged.
- The model predictions remain the same despite different coefficient values.

Regularization Impact:
- Unequal scaling affects how Ridge applies penalties to features.
- The penalty is relative to the coefficient size, so unscaled features might be over-penalized.

Why Standardization Matters in Finance:
- Stock market features (e.g., price, volume, volatility) have very different scales.
- Without standardization, Ridge regression may improperly shrink some features more than others.

Takeaways for Financial Modeling

✅ Always standardize financial data before Ridge regression to ensure fair regularization across features.

✅ Scaling a feature affects its coefficient but not the model's predictions.

✅ In asset management, standardization prevents misleading importance weights in multi-factor models.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import yfinance as yf

# Fetch historical stock data (Apple Inc.)
ticker = "AAPL"
df = yf.download(ticker, period="2y", interval="1d")

# Feature Engineering: Creating financial predictors
df["Return"] = df["Adj Close"].pct_change() * 100  # Daily return in %
df["Volatility"] = df["Return"].rolling(10).std()  # 10-day rolling volatility
df["Momentum"] = df["Adj Close"] / df["Adj Close"].shift(10) - 1  # 10-day momentum
df["Volume"] = df["Volume"] / 1e6  # Scale Volume to millions
df["SMA_10"] = df["Adj Close"].rolling(10).mean()  # 10-day Simple Moving Average
df = df.dropna()

# Define predictors and target
features = ["Volatility", "Momentum", "Volume", "SMA_10"]
X = df[features]
y = df["Return"]

# Multiply 'Volume' by 10 to test scaling effects
X_scaled = X.copy()
X_scaled["Volume"] *= 10

# Standardize both versions of X
scaler = StandardScaler()
X_std = scaler.fit_transform(X)
X_scaled_std = scaler.fit_transform(X_scaled)

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X_std, y, test_size=0.2, random_state=42)
X_scaled_train, X_scaled_test, _, _ = train_test_split(X_scaled_std, y, test_size=0.2, random_state=42)

# Train Ridge Regression models
ridge_original = Ridge(alpha=1).fit(X_train, y_train)
ridge_scaled = Ridge(alpha=1).fit(X_scaled_train, y_train)

# Print Coefficients
coef_df = pd.DataFrame({
    "Feature": features,
    "Original Coefficients": ridge_original.coef_,
    "Scaled Coefficients": ridge_scaled.coef_
})

print(coef_df)

# Plot the effect of scaling
plt.figure(figsize=(8,5))
plt.bar(coef_df["Feature"], coef_df["Original Coefficients"], alpha=0.6, label="Original")
plt.bar(coef_df["Feature"], coef_df["Scaled Coefficients"], alpha=0.6, label="Scaled")
plt.xlabel("Feature")
plt.ylabel("Coefficient Value")
plt.title("Effect of Scaling 'Volume' on Ridge Regression Coefficients")
plt.legend()
plt.show()


### When to Use Ridge Regression Instead of Least-Squares Regression
Least-squares regression (OLS) is the standard method for fitting a linear model, but it has significant limitations in certain situations. 

Ridge regression is preferred when these limitations arise.

Ridge regression is better when:
- Multicollinearity exists (e.g., correlated financial variables).
- The number of predictors exceeds the number of observations (e.g., high-dimensional datasets).
- Data has high variance/noise, leading to overfitting in OLS.

##### When Least-Squares Regression Fails & Ridge Regression Excels

| Scenario | Why OLS Fails?  | Why Ridge Regression Works?  | 
|---------------|--------|----------------------------------|   
|High Multicollinearity (Highly Correlated Predictors)	|OLS produces unstable and large coefficients, leading to overfitting. |Ridge shrinks coefficients and reduces multicollinearity effects| 
|More Predictors than Observations (High-Dimensional Data (p>n)|OLS cannot uniquely estimate coefficients when p>n (underdetermined system). |Ridge introduces regularization, making estimation possible. | 
|Noisy Data with Many Insignificant Predictors  |OLS fits noise and suffers from high variance (overfitting).|	Ridge shrinks less useful coefficients towards zero, improving generalization.| 
|Predictors Have Large Variations in Scale |OLS coefficients can be heavily influenced by features with larger magnitudes. |Ridge makes coefficients more balanced by penalizing their magnitudes. |

Ridge Regression Fixing Multicollinearity
- Consider a financial dataset where marketing spend and advertising spend are highly correlated. OLS will produce unreliable, large coefficients, but Ridge will stabilize them.

In [None]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.model_selection import train_test_split

# Simulated dataset with multicollinearity
np.random.seed(42)
n = 100
X1 = np.random.rand(n) * 10  # Marketing spend
X2 = X1 + np.random.normal(0, 0.5, n)  # Advertising spend (highly correlated with X1)
X3 = np.random.rand(n) * 5  # Independent feature
y = 3 * X1 + 2 * X2 + 1.5 * X3 + np.random.normal(0, 2, n)  # Target variable

# Create DataFrame
df = pd.DataFrame({'Marketing Spend': X1, 'Advertising Spend': X2, 'Other Feature': X3, 'Revenue': y})

# Train-test split
X = df[['Marketing Spend', 'Advertising Spend', 'Other Feature']]
y = df['Revenue']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# OLS Regression
ols = LinearRegression()
ols.fit(X_train, y_train)

# Ridge Regression
ridge = Ridge(alpha=1)
ridge.fit(X_train, y_train)

# Print coefficients
print("OLS Coefficients:", ols.coef_)
print("Ridge Coefficients:", ridge.coef_)

### Advantages of Ridge Regression Over Best Subset Selection
Both ridge regression and best subset selection are used for handling multiple predictors in regression models. However, ridge regression is often preferred because of its computational efficiency and ability to handle high-dimensional data.

Key Differences Between Ridge Regression and Best Subset Selection

| Criteria | Ridge Regression  | Best Subset Selection  | 
|---------------|--------|----------------------------------|   
|Handling Multicollinearity	|Shrinks correlated features, making coefficients more stable. |Cannot handle multicollinearity well, as it selects individual variables.| 
|Computational Efficiency|Fast and scalable for large datasets. |Selects a subset of predictors, dropping some entirely. | 
|Feature Selection |Keeps all predictors but shrinks less important ones.|	Ridge shrinks less useful coefficients towards zero, improving generalization.| 
|Overfitting Prevention | Reduces variance and improves generalization. | Can overfit if the best subset is not chosen carefully. |
|Applicability in High Dimensions (p>n)  | Works well even with more predictors than observations. | Fails when p>n as there are not enough observations to fit the model. |
|Interpretability | Harder to interpret since all predictors contribute. | Easier to interpret as some coefficients are set to zero. |

##### Key Advantages of Ridge Over Best Subset Selection
1️⃣ Ridge Handles Multicollinearity Better
- Best subset selection chooses individual variables, which can lead to unstable models when predictors are correlated.
- Ridge shrinks coefficients without removing variables, reducing variance while preserving information.

2️⃣ Ridge Is Computationally Efficient
- Best subset selection requires testing all possible subsets, which is exponential in complexity $p^2$  making it infeasible for large p.
- Ridge regression solves an optimization problem efficiently using closed-form solutions or gradient-based methods.

3️⃣ Ridge Works Well When 𝑝>𝑛
- Best subset selection cannot work when the number of predictors exceeds the number of observations.
- Ridge can handle high-dimensional data by shrinking coefficients instead of eliminating variables.

4️⃣ Ridge Improves Generalization
- Best subset selection picks the best combination based on training data, which may not generalize well.
- Ridge reduces overfitting by introducing regularization, improving test performance.

##### When to Use Ridge vs. Best Subset Selection?
✅ Use Ridge Regression When:
- The dataset has many correlated predictors (e.g., financial indicators).
- There are more predictors than observations  (p>n).
- You want better predictive accuracy rather than interpretability.

✅ Use Best Subset Selection When:
- You have a small number of predictors and want an interpretable model.
- Multicollinearity is not a major issue.
- You prioritize feature selection over prediction accuracy.

##### Comparison of Ridge Regression and Best Subset Selection in Python
We will use a simulated dataset with multiple correlated predictors to compare Ridge Regression and Best Subset Selection (Exhaustive Feature Selection).

Key Takeaways
- Ridge Regression retains all predictors but shrinks their coefficients to reduce overfitting.
- Best Subset Selection eliminates some features (e.g., X2, X4, X5), keeping only the most important ones.
- RMSE is lower for Ridge Regression than for Best Subset Selection, showing that Ridge generalizes better.

✅ Ridge is better for prediction when multicollinearity exists.

✅ Best Subset is better for feature selection and interpretability.

In [None]:
# Step 1: Install Required Libraries
# First, install mlxtend for Best Subset Selection:
!pip install mlxtend

In [None]:
# Step 2: Import Necessary Libraries
import numpy as np
import pandas as pd
import itertools
from sklearn.linear_model import Ridge, LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from mlxtend.feature_selection import ExhaustiveFeatureSelector as EFS


# Step 3:  Generate Simulated Data
# Set seed for reproducibility
np.random.seed(42)

# Simulating correlated predictors
n = 100
X1 = np.random.rand(n) * 10  
X2 = X1 + np.random.normal(0, 0.5, n)  # Highly correlated with X1
X3 = np.random.rand(n) * 5  
X4 = np.random.rand(n) * 2  
X5 = np.random.rand(n) * 7  

# Target variable (dependent on X1, X2, X3)
y = 3 * X1 + 2 * X2 + 1.5 * X3 + np.random.normal(0, 2, n)

# Create DataFrame
df = pd.DataFrame({'X1': X1, 'X2': X2, 'X3': X3, 'X4': X4, 'X5': X5, 'y': y})

# Train-test split
X = df[['X1', 'X2', 'X3', 'X4', 'X5']]
y = df['y']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


# Step 4: Ridge Regression
# Ridge regression with alpha = 1
ridge = Ridge(alpha=1)
ridge.fit(X_train, y_train)

# Predictions and RMSE
y_pred_ridge = ridge.predict(X_test)
ridge_rmse = np.sqrt(mean_squared_error(y_test, y_pred_ridge))

# Print coefficients and RMSE
print("Ridge Regression Coefficients:", ridge.coef_)
print("Ridge Regression RMSE:", ridge_rmse)

# Step 5: Best Subset Selection
# Perform Best Subset Selection
efs = EFS(LinearRegression(), 
          min_features=1, 
          max_features=5,  # Consider all predictors
          scoring='neg_mean_squared_error', 
          cv=5)

efs.fit(X_train, y_train)

# Get best subset of features
best_features = list(efs.best_idx_)
best_feature_names = X.columns[list(efs.best_idx_)]

# Fit model with selected features
X_train_best = X_train.iloc[:, best_features]
X_test_best = X_test.iloc[:, best_features]

best_model = LinearRegression()
best_model.fit(X_train_best, y_train)

# Predictions and RMSE
y_pred_best_subset = best_model.predict(X_test_best)
best_subset_rmse = np.sqrt(mean_squared_error(y_test, y_pred_best_subset))

# Print selected features and RMSE
print("Best Subset Selected Features:", best_feature_names)
print("Best Subset Selection RMSE:", best_subset_rmse)


### Review of Data Scaling / Standardisng a Predictor

Data scaling is essential in regularisation as regularising penalizes a model for large coefficients. 

The magnitude of coefficients is dependent on the following:

* The strength of the relationship between the predictor variables (`x`) and the output variable (`y`)
* The units of measurement of x(eg. distance measured in millimetres or metres).

For example, if x is measured in metres, and its coefficient is 5; if it is expressed in kilometres, its coefficient will be 5*10³.

We want regularisation to be impacted by the strength of the relationship that exists between `x` and `y` variables and not the magnitude of the coefficients.
- Thus, to eliminate the impact of the units of measurement of the variables on the coefficients, 
- Performed data scaling to ensure variables are fairly scaled. 

Standardizing the Predictors in Ridge Regression
- Standardization is a crucial preprocessing step in Ridge Regression, especially when predictors (features) have different scales.

What is Standardization?
- Standardization transforms each predictor so that it has:
    - Mean = 0
    - Standard deviation = 1

- The formula for standardizing a predictor $x_j$ is:
$$x_j^{(std)} = \frac{x_j - \bar{x_j}}{s_j}$$
- where:
    - $\bar{x_j}$ = Mean of predictor $x_j$
    - $s_j$ = Standard deviation of predictor $x_j$
- This ensures that all features contribute equally to Ridge regression’s penalty term.

Why Standardization is Important in Ridge Regression
- Equalizes Feature Influence: Features with large values (e.g., trading volume in millions) don’t dominate over small-valued features (e.g., daily return in %).
- Ensures Proper Regularization: Ridge regression penalizes larger coefficients more. If one feature has large values, it gets smaller penalties, distorting regularization.
- Improves Model Convergence: Gradient-based solvers used in Ridge regression work better with standardized data.

Effect of Standardization on Ridge Regression

Before standardization:
- Different features contribute unequally to Ridge’s L2 penalty.
- Some features dominate simply due to their scale.

After standardization:
- All features contribute fairly to the model.
- Ridge regression applies equal shrinkage across all predictors.

Key Takeaways

✅ Always standardize predictors before applying Ridge regression.

✅ Prevents misleading coefficient shrinkage.

✅ Ensures numerical stability and better model performance.

**Z-score standardisation** is a great way to scale variables such that they have similar (though not identical) ranges, in a way that is fairly robust to outlier values.

In [None]:
# Split data into predictors and response
X = df.drop('ZAR/USD', axis=1)
y = df['ZAR/USD']

# Import scaler method from sklearn
from sklearn.preprocessing import StandardScaler

# Create scaler object
scaler = StandardScaler()

# Create scaled version of the predictors (there is no need to scale the response)
X_scaled = scaler.fit_transform(X)

# Check mean and standard deviation after scaling
print("Feature Means (after standardization):", X_scaled.mean(axis=0))  # Should be close to 0
print("Feature Std Dev (after standardization):", X_scaled.std(axis=0))  # Should be close to 1

# Convert the scaled predictor values into a dataframe
X_standardise = pd.DataFrame(X_scaled,columns=X.columns)
X_standardise.head()

# 3.1. Regularisation Methods: LASSO Regression

- Understand the difference between L1 and L2 regularisation
- Understand the concept of sparsity.

### Comparison of Ridge Regression and Lasso Regression in Python
Compare Ridge Regression and Lasso Regression to understand their differences, advantages, and disadvantages.

| Criteria | Ridge Regression  | Lasso Regression  | 
|---------------|--------|----------------------------------|   
|Handling Multicollinearity	|Handles multicollinearity well |Can randomly pick one correlated variable and drop others.| 
|Computational Efficiency|Fast and scalable |Can be slow for large feature sets | 
|Effect on Coefficient | Shrinks but never sets to zero | Shrinks and sets some coefficients to zero (feature selection) |
|Penalty Type | $L_2$ norm: $\lambda \sum \beta^2_j$ | $L_1$ norm: $\lambda \sum \| \beta_j \|$ |
|Interpretability | Harder to interpret since all predictors contribute.  (keeps all variables) | Easier to interpret as some coefficients are set to zero. (performs feature selection)|
|Use Case |Best for reducing variance without eliminating variables|Best when a sparse model (fewer variables) is desired|

### L1 (LASSO) vs. L2(Ridge) Regularization Techniques

The key difference is in how they assign penalties to the coefficients:

Ridge Regression:
- Performs L2 regularization, i.e., adds penalty equivalent to the square of the magnitude of coefficients
    - Minimization objective = LS Obj + α * (sum of square of coefficients)

Lasso Regression:
- Performs L1 regularization, i.e., adds penalty equivalent to the absolute value of the magnitude of coefficients
    - Minimization objective = LS Obj + α * (sum of the absolute value of coefficients)

LS Obj refers to the ‘least squares objective,’ i.e., the linear regression objective without regularization.

#### Key Differences between Ridge and Lasso Regression
Observations:

1️⃣ Ridge Shrinks Coefficients But Keeps All Features
- Ridge regression helps us to reduce only the overfitting in the model while keeping all the features present in the model.
    - It reduces the complexity of the model by shrinking the coefficients whereas Lasso regression helps in reducing the problem of overfitting in the model as well as automatic feature selection.
- All coefficients remain nonzero.
- Good when we believe all variables contribute to the model.

3️⃣ Ridge Has a Lower RMSE than Lasso
- Ridge typically performs better in predictive accuracy when all variables contain useful information.

2️⃣ Lasso Shrinks Some Coefficients to Zero (Feature Selection)
- Lasso Regression tends to make coefficients to absolute zero whereas Ridge regression never sets the value of coefficient to absolute zero.
    - $𝑋_2, 𝑋_4, and 𝑋_5$ are eliminated from the model.
- Lasso automatically selects the most relevant features.

4️⃣ Lasso is More Interpretable
- Lasso provides a sparse solution, making it easier to interpret.

##### Disadvantages of Ridge Regression
1️⃣ Does Not Perform Feature Selection
- Ridge shrinks coefficients but never forces them to zero, meaning it retains all features, even unimportant ones.
- This makes it harder to interpret the model.

2️⃣ Less Effective for Sparse Solutions
- If many variables have no real effect on the target, Ridge still assigns them small coefficients, leading to unnecessary complexity.

3️⃣ Sensitive to Feature Scaling
- Requires standardization for optimal performance.

### Shrinkage Methods

In ridge regression, we learned that it is possible to modify and potentially improve the test-set performance of a least squares regression model by reducing the magnitude of some subset of the coefficients $\hat{\beta}$.
- The ridge regression process of reducing the magnitude of those coefficients is a type of _shrinkage_ method - we are attempting to shrink the values of those less important coefficients.
- In ridge regression, it is possible to shrink a coefficient's value towards zero, but never reaching exactly zero.

### Sparsity

L1 penalty has the eﬀect of forcing some of the coeﬃcient estimates to be exactly equal to zero which means there is a complete removal of some of the features for model evaluation when the tuning parameter λ is suﬃciently large.
- Therefore, the lasso method also performs Feature selection and is said to yield sparse models.

##### Advantages of Lasso Over Ridge Regression
✅ 1. Automatic Feature Selection
- Lasso forces some coefficients to zero, making it useful when only a few variables are important.

✅ 2. More Interpretable Models
- Since Lasso removes irrelevant features, the model is easier to explain and analyze.

✅ 3. Works Well When Many Features Are Irrelevant
- If there are many insignificant predictors, Lasso can remove them, leading to a sparser and more efficient model.

##### Limitation of Lasso Regression:

Problem - types of Dataset: 
- If the number of predictors is greater than the number of data points, 
    - Lasso will pick at most n predictors as non-zero, even if all predictors are relevant.

Multicollinearity Problem: 
- If there are two or more highly collinear variables then LASSO regression selects one of them randomly which is not good for the interpretation of our model.

##### When to Use Ridge vs. Lasso?
📌 Use Ridge When:
- Multicollinearity is high (correlated predictors).
- All features are expected to contribute to the model.
- Predictive accuracy is more important than interpretability.

📌 Use Lasso When:
- Feature selection is needed (only a few predictors are important).
- You want a simple and interpretable model.
- Some variables are irrelevant (Lasso will drop them).

### **LASSO Regression**

Description
- Lasso regression, or Least Absolute Shrinkage and Selection Operator, 
- is a regularization method that also includes a penalty term but can set some coefficients exactly to zero, effectively selecting relevant features.

Penalty Type
- Lasso regression employs an L1 penalty, 
    - which sums the absolute values of the coefficients multiplied by lambda.

Coefficient Impact
- The L1 penalty in lasso regression can drive some coefficients to exactly zero when the lambda value is large enough, performing feature selection and resulting in a sparse model.

Feature Selection
- Lasso regression can set some coefficients to zero, effectively selecting the most relevant features and improving model interpretability.

Use Case
- Lasso regression is preferred when the goal is feature selection, resulting in a simpler and more interpretable model with fewer variables.

Model Complexity
- Lasso regression can lead to a less complex model by setting some coefficients to zero, reducing the number of effective parameters.

Interpretability
- Lasso regression can improve interpretability by selecting only the most relevant features, making the model’s predictions more explainable.

Sparsity
- Lasso regression can produce sparse models by setting some coefficients to exactly zero.

Sensitivity
- More sensitive to outliers due to the absolute value in the penalty term.

Final Thoughts
- Ridge is better for prediction when all variables have some effect.
- Lasso is better when we suspect that many variables are irrelevant.
- Elastic Net (a hybrid of Ridge and Lasso) can be used when both regularization and feature selection are needed.

In [None]:
from sklearn.linear_model import Lasso

# Define Ridge and Lasso models
ridge = Ridge(alpha=1)
lasso = Lasso(alpha=0.1)  # Smaller alpha to avoid excessive shrinkage

# Fit models
ridge.fit(X_train, y_train)
lasso.fit(X_train, y_train)

# Predictions
y_pred_ridge = ridge.predict(X_test)
y_pred_lasso = lasso.predict(X_test)

# Compute RMSE
ridge_rmse = np.sqrt(mean_squared_error(y_test, y_pred_ridge))
lasso_rmse = np.sqrt(mean_squared_error(y_test, y_pred_lasso))

# Print coefficients and RMSE
print("Ridge Regression Coefficients:", ridge.coef_)
print("Ridge Regression RMSE:", ridge_rmse)

print("Lasso Regression Coefficients:", lasso.coef_)
print("Lasso Regression RMSE:", lasso_rmse)


Recall the optimisation expression for ridge regression:

$$\min_{\beta} (RSS + \alpha\sum_{j=1}^pb_j^2)$$

where we attempt to minimise the RSS and some penalty term. This can be rewritten:

$$\min_{\beta} (RSS + \alpha(L2\_norm))$$

where $L2\_norm$ is the *sum of the squares of the coefficients*.

In LASSO regularisation, 
- we replace the $L2\_norm$ with what is known as the $L1\_norm$: the *sum of the _absolute_ values of the coefficients*.

This is a relatively recent adaptation of ridge regression which is capable of shrinking predictors to exactly zero - effectively removing them from the model entirely and creating what we call a sparse model (one which uses some subset of all of the available predictors).

LASSO achieves both shrinkage and subset selection.

### Objective function

A LASSO model is fit under the constraint of minimizing the following equation:

$$\sum_{i=1}^n(y_i-(a+\sum_{j=1}^pb_jx_{ij}))^2 + \alpha\sum_{j=1}^p|b_j|$$

which can be rewritten as follows:

$$\min_{\beta} (RSS + \lambda\sum_{j=1}^p|b_j|)$$

or,

$$\min_{\beta} (RSS + \lambda(L1\_norm))$$

Lasso regression performs L1 regularization, i.e., it adds a factor of the sum of the absolute value of coefficients in the optimization objective.

Objective = RSS + $\lambda$ * (sum of the absolute value of coefficients)

Breakdown of the terms:
- **Residual Sum of Squares (RSS):**
    - In minimising _RSS_ , we improve the overall fit of the model. 
$$\sum^n_{i = 1} (y_i - \hat{y_i})^2 $$
- This represents the sum of squared differences between the observed values $y_i$ and the predicted values $\hat{y_i}$.

- **L1 Regularization Term (Penalty Term):**
    - In Lasso Regression, the L1 norm is used as a penalty term
    - The L1 norm is the sum of the absolute values of the coefficients:
$$L1_norm =  \lambda \sum^p_{j = 1} \| \beta_j \| $$
- Lasso regression performs ‘L1 regularization‘, i.e., it adds a factor of the sum of absolute value of coefficients in the optimization objective.
    - This term penalizes large coefficient values, preventing overfitting. 
- Calculating the L2 Norm in Ridge Regression
    - In Lasso Regression, the L1 norm  is the sum of the absolute values of the regression coefficients. 
    - The L1 norm is used as a penalty in Lasso regression:
$$ \sum^p_{j = 1} \| \beta_j \|$$
- where:
    - $𝛽_𝑗$ are the regression coefficients.
    - p is the number of features.

In Lasso Regression, the L1 norm is used as a penalty term:
$$ min ||y - X \beta||^2 + \lambda \sum^p_{j = 1} \| \beta_j \| $$

- **$\lambda$ (alpha)** 
    - works similar to that of the ridge and provides a trade-off between balancing RSS and the magnitude of coefficients. 
    - Like that of the ridge, $\lambda$ can take various values.
        - $\lambda$ = 0: Same coefficients as simple linear regression
        - $\lambda$ = ∞: All coefficients zero (same logic as before)
        - 0 < $\lambda$ < ∞: coefficients between 0 and that of simple linear regression

In [None]:
# Separate the features from the response
X = df.drop('ZAR/USD', axis=1)
y = df['ZAR/USD']

In [None]:
# Import the scaling module
from sklearn.preprocessing import StandardScaler

# Create standardization object
scaler = StandardScaler()

# Save standardized features into new variable
X_scaled = scaler.fit_transform(X)

In [None]:
# Import train/test split module
from sklearn.model_selection import train_test_split

# Split dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, 
                                                    y, 
                                                    test_size=0.20,
                                                    random_state=1,
                                                    shuffle=False)

##### Calculate L1 Norm

In [None]:
# Import LASSO module
from sklearn.linear_model import Lasso

# Create LASSO model object, setting alpha to 0.01
lasso = Lasso(alpha=0.01)
# or
# Example Data (X_train, y_train should be pre-defined)
lasso2 = Lasso(alpha=0.1)  # Using Lasso Regression

# Train the LASSO model
lasso.fit(X_train, y_train)

# Extract intercept from model
intercept = float(lasso.intercept_)

# Extract coefficient from model
coeff = pd.DataFrame(lasso.coef_, X.columns, columns=['Coefficient'])

# Calculate L1 Norm
l1_norm = np.sum(np.abs(lasso_coefficients))

# Extract intercept
print("Intercept:", float(intercept))
print("Lasso Coefficients:", coeff)
print("L1 Norm of Coefficients:", l1_norm)

##### Interpretation of the intercept and coefficients

We interpret the values of the intercept and coefficients the same way as before:

 - The intercept can be interpreted as the **expected exchange rate when all the features are equal to their means**.
 - Each coefficient is interpreted as the expected change in the response variable given an increase of 1 in the **scaled feature value**.
 
See from the list of coefficients above that some of the coefficients have indeed been shrunk to exactly zero.
- ✅ Some coefficients are zeroed out → Feature selection.

✅ The L1 norm is the sum of the absolute values of nonzero coefficients.

Why is the L1 Norm Important?
- 📌 Used for feature selection (unlike Ridge which only shrinks).
- 📌 Encourages sparsity, keeping only the most relevant predictors.
- 📌 Higher alpha (λ) → More coefficients become zero (stronger penalty).

##### Assessment of predictive accuracy
fit the following models as well, in order to compare the LASSO results thoroughly:

- A least squares model using all available predictors;
- A least squares model using the predictors with non-zero coefficients from LASSO;
- A ridge regression model using all available predictors.

In [None]:
# Fit a basic linear model
from sklearn.linear_model import LinearRegression, Ridge

X_subset = df.drop(['ZAR/USD',
                   'Total Reserves excl Gold (USD)',
                   'IMF Reserve Position (USD)',
                   'Claims on Non-residents (USD)',
                   'Central Bank Policy Rate',
                   'Treasury Bill Rate',
                   'Savings Rate',
                   'Deposit Rate',
                   'Lending Rate',
                   'Government Bonds'], axis=1)

X_subset_scaled = scaler.fit_transform(X_subset)

X_train2, X_test2, y_train2, y_test2 = train_test_split(X_subset, 
                                                        y, 
                                                        test_size=0.20, 
                                                        random_state=1,
                                                        shuffle=False)

# Least squares using non-zero variables from LASSO
lm_subset = LinearRegression()

# Least squares using all predictors
lm_all = LinearRegression()

# Ridge using all predictors
ridge = Ridge()

lm_subset.fit(X_train2, y_train2)
lm_all.fit(X_train, y_train)
ridge.fit(X_train, y_train)

In [None]:
from sklearn import metrics

# Make training set predictions for each model
train_lm_subset = lm_subset.predict(X_train2)
train_lm_all = lm_all.predict(X_train)
train_ridge = ridge.predict(X_train)
train_lasso = lasso.predict(X_train)

# Make test set predictions for each model
test_lm_subset = lm_subset.predict(X_test2)
test_lm_all = lm_all.predict(X_test)
test_ridge = ridge.predict(X_test)
test_lasso = lasso.predict(X_test)

In [None]:
# Dictionary of results
results_dict = {'Training MSE':
                    {
                        "Least Squares, Subset": metrics.mean_squared_error(y_train2, train_lm_subset),
                        "Least Squares, All": metrics.mean_squared_error(y_train, train_lm_all),
                        "Ridge": metrics.mean_squared_error(y_train, train_ridge),
                        "LASSO": metrics.mean_squared_error(y_train, train_lasso)
                    },
                    'Test MSE':
                    {
                        "Least Squares, Subset": metrics.mean_squared_error(y_test2, test_lm_subset),
                        "Least Squares, All": metrics.mean_squared_error(y_test, test_lm_all),
                        "Ridge": metrics.mean_squared_error(y_test, test_ridge),
                        "LASSO": metrics.mean_squared_error(y_test, test_lasso)
                    }
                }

In [None]:
# Create dataframe from dictionary
results_df = pd.DataFrame(data=results_dict)

# View the results
results_df

##### Result interpretation

LASSO was able to perform subset selection, while also performing shrinkage. 
- The result is a more generalised model with greater predictive capacity. 

The least squares model which we trained on the same subset of variables that LASSO retained as non-zero scored a higher MSE on the test set, 
- indicating that the shrinkage that LASSO applied to those remaining variables was effective.

LASSO achieved the best MSE on the test set, followed by ridge regression.

##### Plot our results to end off.
plot the the test set versus the three primary methods explored here:

- Least squares using all predictors;
- Ridge using all predictors;
- LASSO using all predictors.

In [None]:
##### we want to plot the train and test response variables as a continuous line
train_plot = y_train.append(pd.Series(y_test[0], index=['2016M01']))

plt.plot(np.arange(96,120), lasso.predict(X_test), label='LASSO')
plt.plot(np.arange(96,120), ridge.predict(X_test), label='Ridge')
plt.plot(np.arange(96,120), lm_all.predict(X_test), label='Least Squares')
plt.plot(np.arange(96,120), y_test, label='Testing')
plt.legend()

plt.show()

##### **Effect of the Tuning Parameter $\lambda$ on Coefficients in Lasso Regression**
In Lasso Regression, the tuning parameter λ (alpha in Python) controls the L1 penalty, which affects the size of the regression coefficients and determines feature selection.
$$min||y - X\beta ||^2 + \lambda \sum^P_{j = 1} |\beta_j| $$
- where:
    - $||y - X\beta ||^2 $:  least squares error.
    - $\sum^P_{j = 1} |\beta_j|$: is the L1 penalty that forces some coefficients to shrink to zero.

Impact of λ on Coefficients

Small λ (Near Zero)
- Very little penalty → Lasso behaves like ordinary least squares (OLS).
- Most coefficients remain nonzero.
- Model may overfit the data.

Moderate λ
- Some coefficients shrink close to zero, but not all.
- Feature selection occurs → Less important variables are eliminated.
- Model balances bias-variance tradeoff.

Large λ
- Strong penalty on large coefficients.
- Many coefficients shrink exactly to zero → Only the most important features remain.
- Model has higher bias but less variance.
- If λ is too large, it removes too many features and model underfits.

Expected Output

A coefficient path plot that shows: 

✅ Small 𝜆 → Most coefficients are large (OLS behavior).

✅ Medium λ → Some coefficients shrink, some go to zero.

✅ Large λ → Many coefficients disappear, leaving only a few nonzero values.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Lasso

# Example: Train Lasso with different alpha values
alphas = [0.01, 0.1, 1, 10, 100]  # Different lambda values
lasso_coeffs = []

for alpha in alphas:
    lasso = Lasso(alpha=alpha)
    lasso.fit(X_train, y_train)
    lasso_coeffs.append(lasso.coef_)

# Convert to NumPy array for plotting
lasso_coeffs = np.array(lasso_coeffs)

# Plot coefficient paths
plt.figure(figsize=(8, 5))
for i in range(X_train.shape[1]):
    plt.plot(alphas, lasso_coeffs[:, i], marker='o', label=f'Feature {i+1}')

plt.xscale("log")  # Log scale for better visualization
plt.xlabel("Lambda (α)")
plt.ylabel("Coefficient Values")
plt.title("Effect of Tuning Parameter (λ) on Lasso Coefficients")
plt.legend()
plt.show()


##### **Alternative Formulation of the Objective Functions for Lasso and Ridge Regression**
Both Lasso and Ridge Regression aim to regularize linear regression models by adding a constraint on the size of the regression coefficients.

Instead of writing their objective functions in a penalized form, they can also be expressed in a constraint-based form.

**Ridge Regression (L2 Regularization)**

Penalized Formulation
$$min||y - X\beta ||^2 + \lambda \sum^P_{j = 1} \beta_j^2 $$
- where:
    - $||y - X\beta ||^2 $:  least squares error.
    - $\sum^P_{j = 1} \beta_j^2$: is the L2 penalty that forces some coefficients to shrink to zero.

Alternative Constraint-Based Formulation
$$min||y - X\beta ||^2 subject to \sum^P_{j = 1} \beta_j^2 \leq t$$
- where:
    - Instead of adding a penalty, we constrain the sum of squared coefficients to be within a threshold t.
    - Smaller t leads to stronger regularization, forcing more shrinkage of coefficients.

✅ Interpretation: The Ridge solution lies inside a p-dimensional hypersphere with radius

**Lasso Regression (L1 Regularization)**

Penalized Formulation
$$min||y - X\beta ||^2 + \lambda \sum^P_{j = 1} |\beta_j| $$
- where:
    - $||y - X\beta ||^2 $:  sum of squared residuals..
    - $\sum^P_{j = 1} |\beta_j|$: is the L1 penalty which induces sparsity.

Alternative Constraint-Based Formulation
$$min||y - X\beta ||^2 subject to \sum^P_{j = 1} |\beta_j| \leq t$$
- where:
    - The sum of absolute values of coefficients is constrained within t.
    - Stronger regularization (small t) forces many coefficients to exactly zero.

✅ Interpretation: The Lasso solution lies inside a p-dimensional diamond-shaped (L1 norm) constraint region.

Key Differences in Constraint Regions
- Ridge: Constraint region is a hypersphere → smooth shrinkage of coefficients.
- Lasso: Constraint region is a diamond → Encourages zero coefficients (feature selection).

Conclusion
- Both Lasso and Ridge can be formulated either in penalized form or constraint-based form.
- Lasso's constraint is diamond-shaped, making it more likely to force coefficients to zero.
- Ridge's constraint is spherical, meaning it only shrinks coefficients but does not remove them.

In [None]:
# Step 1: Import Libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge, Lasso
from sklearn.datasets import make_regression
from sklearn.preprocessing import StandardScaler

# Step 2: Generate Synthetic Data
# Generate synthetic dataset
np.random.seed(42)
X, y = make_regression(n_samples=100, n_features=10, noise=10)

# Standardize the features (important for Ridge/Lasso)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Step 3: Train Ridge and Lasso for Different λ Values
# Define different lambda values
lambdas = [0.01, 0.1, 1, 10, 100]

ridge_coefs = []
lasso_coefs = []

for alpha in lambdas:
    # Train Ridge Regression
    ridge = Ridge(alpha=alpha)
    ridge.fit(X_scaled, y)
    ridge_coefs.append(ridge.coef_)
    
    # Train Lasso Regression
    lasso = Lasso(alpha=alpha)
    lasso.fit(X_scaled, y)
    lasso_coefs.append(lasso.coef_)

# Step 4: Visualize the Effect of Regularization
ridge_coefs = np.array(ridge_coefs)
lasso_coefs = np.array(lasso_coefs)

plt.figure(figsize=(12, 6))

# Plot Ridge Coefficients
plt.subplot(1, 2, 1)
for i in range(X.shape[1]):
    plt.plot(lambdas, ridge_coefs[:, i], marker='o', label=f'Feature {i+1}')
plt.xscale("log")
plt.xlabel("Lambda (α)")
plt.ylabel("Coefficient Value")
plt.title("Ridge Regression: Effect of Regularization")
plt.legend()

# Plot Lasso Coefficients
plt.subplot(1, 2, 2)
for i in range(X.shape[1]):
    plt.plot(lambdas, lasso_coefs[:, i], marker='o', label=f'Feature {i+1}')
plt.xscale("log")
plt.xlabel("Lambda (α)")
plt.ylabel("Coefficient Value")
plt.title("Lasso Regression: Effect of Regularization")
plt.legend()

plt.show()

Expected Output

Ridge Regression Plot:

✅ Coefficients shrink smoothly as λ increases.

✅ No coefficients are set to zero → Ridge does not perform feature selection.

Lasso Regression Plot:

✅ Coefficients shrink and some become exactly zero for larger λ.

✅ Feature selection occurs → Some coefficients are eliminated completely.

_______________

##### **Graphical Interpretation of Lasso and Ridge Regression (Two Features Case)**
When dealing with two features ($𝑋_1, 𝑋_2$), we can visualize how Lasso (L1) and Ridge (L2) regularization constrain the coefficient estimates in a 2D space.

Understanding the Constraint Regions

Both Ridge and Lasso apply constraints on the sum of the regression coefficients, but they do so differently:

| Method | Constraint Shape  | Mathematical Constraint     | Effect on Coefficients  | When to Use?  |
|---------------|--------|----------------------------------| -----------------------|  -----------------------| 
|Ridge Regression (L2)	|Circular region (ellipsoid) |	$𝛽_1^2 + 𝛽_2^2 \leq 𝑡$ | Shrinks coefficients smoothly, but none become exactly zero |If all predictors are important |
|Lasso Regression (L1)	|Diamond-shaped region  (L1 ball) |$∣𝛽_1∣ + ∣𝛽_2∣ \leq 𝑡$  | performs feature selection by setting some coefficients exactly to zero. | If some predictors are irrelevant |

Graphical Explanation

**(A) Ridge Regression (L2)**
- The constraint region is a circle (or ellipse in higher dimensions).
- The OLS solution ($𝛽_1, 𝛽_2$) is found without constraints, but Ridge forces the solution to be inside the circle.
- Since the contour lines of the Least Squares Loss function are ellipses, the Ridge estimate shrinks the coefficients but does not set them exactly to zero.

📌 Key Observation: Ridge does not perform feature selection but shrinks coefficients smoothly.

**(B) Lasso Regression (L1)**
- The constraint region is a diamond (formed by the equation $∣𝛽_1∣ + ∣𝛽_2∣ \leq 𝑡$).
- The Least Squares Loss function has elliptical contours, and the solution is found where these ellipses first touch the diamond.
- Because the diamond has sharp corners, the solution often lands on an axis, meaning one of the coefficients is exactly zero.

📌 Key Observation: Lasso performs feature selection by setting some coefficients exactly to zero.

Visual Representation

Imagine the following plot where:
- ✅ The blue region represents the constraint area (circle for Ridge, diamond for Lasso).
- ✅ The red ellipses represent the contours of the Least Squares Loss function.
- ✅ The black dot represents the OLS solution (unconstrained).
- ✅ The intersection of ellipses and constraint region represents the Ridge/Lasso solution.

In Ridge Regression (left):
- The solution is inside the circle, meaning coefficients shrink but are not exactly zero.
- The blue circular constraint shows how Ridge shrinks coefficients while keeping them nonzero.

In Lasso Regression (right):
- The solution often lands on an axis, meaning one coefficient becomes exactly zero.
- The red diamond-shaped constraint leads to sparsity, setting some coefficients exactly to zero.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Create a grid of beta1 and beta2 values
beta1 = np.linspace(-1, 1, 400)
beta2 = np.linspace(-1, 1, 400)
B1, B2 = np.meshgrid(beta1, beta2)

# Compute the constraint regions
ridge_constraint = B1**2 + B2**2  # L2 norm (circle)
lasso_constraint = np.abs(B1) + np.abs(B2)  # L1 norm (diamond)

# Set levels for contour plot
ridge_level = [0.5]  # Circle boundary (Ridge)
lasso_level = [0.5]  # Diamond boundary (Lasso)

# Plot Ridge Regression Constraint (L2 norm)
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.contour(B1, B2, ridge_constraint, levels=ridge_level, colors='blue', linewidths=2)
plt.axhline(0, color='gray', linestyle='--', linewidth=1)
plt.axvline(0, color='gray', linestyle='--', linewidth=1)
plt.title("Ridge Regression Constraint (L2 Norm)")
plt.xlabel(r"$\beta_1$")
plt.ylabel(r"$\beta_2$")

# Plot Lasso Regression Constraint (L1 norm)
plt.subplot(1, 2, 2)
plt.contour(B1, B2, lasso_constraint, levels=lasso_level, colors='red', linewidths=2)
plt.axhline(0, color='gray', linestyle='--', linewidth=1)
plt.axvline(0, color='gray', linestyle='--', linewidth=1)
plt.title("Lasso Regression Constraint (L1 Norm)")
plt.xlabel(r"$\beta_1$")
plt.ylabel(r"$\beta_2$")

plt.tight_layout()
plt.show()


##### **Geometric Shape of Constraints for Lasso and Ridge Regression in Two or More Dimensions**

When we generalize Lasso (L1) and Ridge (L2) regression constraints to two or more dimensions, their geometric shapes define how they penalize the regression coefficients.

**Ridge Regression (L2) Constraint: A Hypersphere (Ellipsoid in Some Cases)**

Mathematical Constraint:
$$\sum^P_{j = 1} \beta_j^2 \leq t$$
- In 2D ( $𝛽_1, 𝛽_2$), this equation represents a circle.
- In 3D ($𝛽_1, 𝛽_2, 𝛽_3$), it forms a sphere.
- In higher dimensions, it generalizes to a hypersphere or an ellipsoid if the predictors are scaled differently.

📌 Effect:
- The constraint shrinks the regression coefficients toward zero, but they rarely become exactly zero.
- The solution is found at the intersection of the least-squares loss contours (ellipses) and the Ridge constraint (circle or hypersphere).

**Lasso Regression (L1) Constraint: A Hypercube (Diamond or Cross-Polytope in Higher Dimensions)**

Mathematical Constraint:
$$\sum^P_{j = 1} |\beta_j| \leq t$$
- In 2D ( $𝛽_1, 𝛽_2$), this equation represents a diamond (rhombus).
- In 3D ($𝛽_1, 𝛽_2, 𝛽_3$), it forms an octahedron.
- In higher dimensions, it generalizes to a cross-polytope (generalized diamond shape).

📌 Effect:
- The constraint has sharp edges and corners, where the least-squares loss contours are likely to touch.
- This property forces many coefficients to become exactly zero, leading to feature selection.

Visual Comparison of Shapes in Different Dimensions

| Dimenensions | Ridge Regression  | Lasso Regression  | 
|---------------|--------|----------------------------------|   
|2D| Circle |  Diamond (Rhombus)| 
|3D| Sphere | Octahedron | 
|4D+ | Hypersphere (Ellipsoid) | Cross-Polytope |

Intuition
- Ridge Regression (L2) constraint is a smooth shape (circle, sphere, hypersphere), leading to shrinkage but not sparsity (coefficients are small but nonzero).
- Lasso Regression (L1) constraint has sharp edges and corners, making it easier for the optimal solution to land on an axis, leading to sparse solutions (some coefficients exactly zero).

TypeError: Input z must be 2D, not 3D Problem
- In two or more dimensions, the constraints imposed by Ridge and Lasso regression have distinct geometric shapes:
    - Ridge Regression (L2 Norm) - Spherical Constraint:
        - Ridge regression adds an L2 penalty, which constrains the sum of squared coefficients:
        $$\sum^P_{j = 1} \beta_j^2 \leq t$$
        - In 2D ( $𝛽_1, 𝛽_2$), this forms a circular constraint (a Euclidean ball).
        - In higher dimensions, this generalizes to a hypersphere.
        - The coefficients shrink toward zero, but they rarely become exactly zero.
    - Lasso Regression (L1 Norm) - Diamond (Octahedral) Constraint:
        - Lasso regression adds an L1 penalty, constraining the sum of absolute values of the coefficients:
        $$\sum^P_{j = 1} |\beta_j| \leq t$$
        - In two dimensions, this forms a diamond-shaped constraint (or rhombus).
        - In three dimensions, it takes the shape of an octahedron.
        - Because the L1 norm creates sharp corners, solutions tend to lie exactly on the axes, forcing some coefficients to be exactly zero (feature selection).

Geometric Interpretation of Solution Paths
- The optimal regression solution is found at the point where the constraint region intersects with the contours of the least squares loss function (which are ellipses in the case of ordinary least squares).
- Ridge regression solutions tend to shrink coefficients but keep them nonzero since the constraint is smooth and rounded.
- Lasso regression solutions often set some coefficients exactly to zero because the diamond-shaped constraint encourages sparse solutions.

##### This geometric difference explains why Lasso can perform feature selection while Ridge cannot

In [None]:
from mpl_toolkits.mplot3d import Axes3D

# Create a 3D grid of beta1, beta2, and beta3 values
beta_range = np.linspace(-1, 1, 100)
B1, B2, B3 = np.meshgrid(beta_range, beta_range, beta_range)

# Ridge (L2 norm) constraint equation for a sphere: β1^2 + β2^2 + β3^2 = t
ridge_constraint = B1**2 + B2**2 + B3**2

# Lasso (L1 norm) constraint equation for an octahedron: |β1| + |β2| + |β3| = t
lasso_constraint = np.abs(B1) + np.abs(B2) + np.abs(B3)

# Define the constraint level for visualization
constraint_level = 0.5

# Create figure
fig = plt.figure(figsize=(12, 6))

# Plot Ridge Regression Constraint (L2 Norm - Sphere)
ax1 = fig.add_subplot(121, projection='3d')
ax1.contourf(B1, B2, B3, ridge_constraint, levels=[constraint_level], colors=['blue'], alpha=0.5)
ax1.set_title("Ridge Regression Constraint (L2 Norm - Sphere)")
ax1.set_xlabel(r"$\beta_1$")
ax1.set_ylabel(r"$\beta_2$")
ax1.set_zlabel(r"$\beta_3$")

# Plot Lasso Regression Constraint (L1 Norm - Octahedron)
ax2 = fig.add_subplot(122, projection='3d')
ax2.contourf(B1, B2, B3, lasso_constraint, levels=[constraint_level], colors=['red'], alpha=0.5)
ax2.set_title("Lasso Regression Constraint (L1 Norm - Octahedron)")
ax2.set_xlabel(r"$\beta_1$")
ax2.set_ylabel(r"$\beta_2$")
ax2.set_zlabel(r"$\beta_3$")

plt.tight_layout()
plt.show()

#TypeError: Input z must be 2D, not 3D


##### **Impact of Budget Size on Estimating Lasso and Ridge Regression**
The budget size in Ridge and Lasso regression refers to the constraint placed on the sum of squared or absolute values of the coefficients. This budget determines how much shrinkage (regularization) is applied to the regression coefficients.

**Budget in Ridge Regression**

Mathematical Formulation
- Ridge regression minimizes the sum of squared residuals subject to a constraint on the sum of squared coefficients:
$$\sum^P_{j = 1} \beta_j^2 \leq t$$
- where t is the budget that limits the size of the coefficients.

Effect of Budget Size ( t )
- Large Budget (Weak Regularization):
    - Ridge behaves similarly to Ordinary Least Squares (OLS).
    - Coefficients remain large, and overfitting risk increases.
- Small Budget (Strong Regularization):
    - Coefficients shrink towards zero but never become exactly zero.
    - Model has less variance but may introduce bias.

✅ Key Insight: In Ridge, a small budget forces all coefficients to be small, but none of them become exactly zero.

**Budget in Lasso Regression**

Mathematical Formulation
- Lasso regression minimizes the sum of squared residuals subject to a constraint on the sum of absolute values of coefficients:
$$\sum^P_{j = 1} |\beta_j| \leq t$$
- where t is the budget that limits the sum of absolute values of the coefficients.

Effect of Budget Size ( t )
- Large Budget (Weak Regularization):
    - Lasso behaves similarly to OLS.
    - Many coefficients remain large and nonzero.
- Small Budget (Strong Regularization):
    - Many coefficients shrink to exactly zero → Feature selection happens.
    - The model selects only the most relevant predictors, reducing dimensionality.
    
✅ Key Insight: In Lasso, a small budget forces many coefficients to be exactly zero, performing automatic feature selection.

| Aspect | Ridge Regression (L2)  | Lasso Regression (L1)  |
|---------------|--------|----------------------------------|  
|Effect of Large Budget (Small λ)	|Similar to OLS, coefficients remain large |Similar to OLS, all coefficients remain nonzero|
|Effect of Small Budget (Large λ)	|Coefficients shrink but never reach zero |Many coefficients shrink to exactly zero (feature selection) |
|Best for  |When all predictors are relevant|When some predictors are irrelevant | 

 Conclusion
- Ridge regression is better when all variables matter, as it shrinks but does not eliminate coefficients.
- Lasso regression is better when we suspect some variables are unimportant, as it can set coefficients to exactly zero, performing feature selection.
- Budget size (λ) controls how much shrinkage is applied:
    - Small λ (Large Budget) → Model behaves like OLS (no regularization).
    - Large λ (Small Budget) → Ridge keeps all coefficients small, Lasso sets some to zero.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge, Lasso
from sklearn.datasets import make_regression
from sklearn.preprocessing import StandardScaler

# Generate synthetic dataset
np.random.seed(42)
X, y = make_regression(n_samples=100, n_features=10, noise=10)

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Define budget (alpha values)
budgets = [0.01, 0.1, 1, 10, 100]

ridge_coefs = []
lasso_coefs = []

for alpha in budgets:
    # Train Ridge and Lasso with different budgets (alpha)
    ridge = Ridge(alpha=alpha)
    ridge.fit(X_scaled, y)
    ridge_coefs.append(ridge.coef_)
    
    lasso = Lasso(alpha=alpha)
    lasso.fit(X_scaled, y)
    lasso_coefs.append(lasso.coef_)

# Convert to numpy arrays for plotting
ridge_coefs = np.array(ridge_coefs)
lasso_coefs = np.array(lasso_coefs)

plt.figure(figsize=(12, 6))

# Plot Ridge Coefficients
plt.subplot(1, 2, 1)
for i in range(X.shape[1]):
    plt.plot(budgets, ridge_coefs[:, i], marker='o', label=f'Feature {i+1}')
plt.xscale("log")
plt.xlabel("Budget (Alpha)")
plt.ylabel("Coefficient Value")
plt.title("Effect of Budget on Ridge Regression")
plt.legend()

# Plot Lasso Coefficients
plt.subplot(1, 2, 2)
for i in range(X.shape[1]):
    plt.plot(budgets, lasso_coefs[:, i], marker='o', label=f'Feature {i+1}')
plt.xscale("log")
plt.xlabel("Budget (Alpha)")
plt.ylabel("Coefficient Value")
plt.title("Effect of Budget on Lasso Regression")
plt.legend()

plt.show()


#### Losso regression in Sine / polynomial problem as below under GLMs

LASSO stands for Least Absolute Shrinkage and Selection Operator.

2 keywords here – 
- absolute and
- selection.

In [None]:
from sklearn.linear_model import Lasso
def lasso_regression(data, predictors, alpha, models_to_plot={}):
    #Fit the model
    lassoreg = Lasso(alpha=alpha,normalize=True, max_iter=1e5)
    lassoreg.fit(data[predictors],data['y'])
    y_pred = lassoreg.predict(data[predictors])
    
    #Check if a plot is to be made for the entered alpha
    if alpha in models_to_plot:
        plt.subplot(models_to_plot[alpha])
        plt.tight_layout()
        plt.plot(data['x'],y_pred)
        plt.plot(data['x'],data['y'],'.')
        plt.title('Plot for alpha: %.3g'%alpha)
    
    #Return the result in pre-defined format
    rss = sum((y_pred-data['y'])**2)
    ret = [rss]
    ret.extend([lassoreg.intercept_])
    ret.extend(lassoreg.coef_)
    return ret

Additional parameters defined in the Lasso function – 
- max_iter.
    - This is the maximum number of iterations for which we want the model to run if it doesn’t converge before. 
    - This exists for Ridge as well, but setting this to a higher than default value was required in this case.

In [None]:
#Initialize predictors to all 15 powers of x
predictors=['x']
predictors.extend(['x_%d'%i for i in range(2,16)])

#Define the alpha values to test
alpha_lasso = [1e-15, 1e-10, 1e-8, 1e-5,1e-4, 1e-3,1e-2, 1, 5, 10]

#Initialize the dataframe to store coefficients
col = ['rss','intercept'] + ['coef_x_%d'%i for i in range(1,16)]
ind = ['alpha_%.2g'%alpha_lasso[i] for i in range(0,10)]
coef_matrix_lasso = pd.DataFrame(index=ind, columns=col)

#Define the models to plot
models_to_plot = {1e-10:231, 1e-5:232,1e-4:233, 1e-3:234, 1e-2:235, 1:236}

#Iterate over the 10 alpha values:
for i in range(10):
    coef_matrix_lasso.iloc[i,] = lasso_regression(data, predictors, alpha_lasso[i], models_to_plot)

Observations:

Model complexity decreases with an increase in the values of alpha. But notice the straight line at alpha=1.

Expected inference: 
- higher RSS for higher alphas
- For the same values of alpha, the coefficients of lasso regression are much smaller than that of ridge regression (compare row 1 of the 2 tables).
- For the same alpha, lasso has higher RSS (poorer fit) as compared to ridge regression.
- Many of the coefficients are zero, even for very small values of alpha.

Check the number of coefficients that are zero in each model.

In [None]:
coef_matrix_lasso.apply(lambda x: sum(x.values==0),axis=1)

Observations:

- small value of alpha, a significant number of coefficients are zero. 
- This also explains the horizontal line fit for alpha=1 in the lasso plots; it’s just a baseline model! 
This phenomenon, where most coefficients become zero, is called **sparsity**. 
- Although lasso performs feature selection, we achieve this level of sparsity only in special cases

#### Mathematics behind why coefficients are zero in the case of lasso but not ridge.

In [None]:
'''
LINEAR, RIDGE AND LASSO REGRESSION
'''
# importing requuired libraries
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso, Ridge

# read test and train file
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

print('\n\n---------DATA---------------\n\n')
print(train.head())

#splitting into training and test
## try building model with the different features and compare the result.
X = train.loc[:,['Outlet_Establishment_Year','Item_MRP']]
x_train, x_cv, y_train, y_cv = train_test_split(X,train.Item_Outlet_Sales,random_state=5)

print('--------Trainig Linear Regression Model---------------')
lreg = LinearRegression()
#training the model
lreg.fit(x_train,y_train)

#predicting on cv
pred = lreg.predict(x_cv)

#calculating mse
mse = np.mean((pred - y_cv)**2)
print('\nMean Sqaured Error = ',mse )

#Let us take a look at the coefficients of this linear regression model.
# calculating coefficients
coeff = DataFrame(x_train.columns)

coeff['Coefficient Estimate'] = Series(lreg.coef_)

print(coeff)

print('\n\nModel performance on Test data = ')
print(lreg.score(x_cv,y_cv))

print('\n\n---------Training Ridge Regression Model----------------')

ridge = Ridge()
ridge.fit(x_train,y_train)
pred1 = ridge.predict(x_cv)
mse_1 = np.mean((pred1-y_cv)**2)

print('\n\nMean Squared Error = ',mse_1)

# calculating coefficients
coeff = DataFrame(x_train.columns)
coeff['Coefficient Estimate'] = Series(ridge.coef_)
print(coeff)

print('\n\nModel performance on Test data = ')
print(ridge.score(x_cv,y_cv))


print('\n\n---------Training Lasso Regression Model----------------')

lasso = Lasso()
lasso.fit(x_train,y_train)
pred2 = lasso.predict(x_cv)
mse_2 = np.mean((pred2-y_cv)**2)

print('\n\nMean Squared Error = ',mse_2)

# calculating coefficients
coeff = DataFrame(x_train.columns)
coeff['Coefficient Estimate'] = Series(lasso.coef_)
print(coeff)

print('\n\nModel performance on Test data = ')
print(lasso.score(x_cv,y_cv))

### Comparison of Ridge, Lasso, and Elastic Net Regression in Python
Now, let's compare Ridge Regression, Lasso Regression, and Elastic Net Regression to see how they perform under different conditions.

**What is Elastic Net Regression?**
- Elastic Net is a combination of Ridge and Lasso, controlled by two parameters:
    - $\Lambda$: Controls the overall penalty strength.
    - p (l1_ratio): 
        - p = 1: Pure Lasso
        - p = 0: Pure Ridge
        - 0 < p < 1: a Mix of both Ridge and Lasso

Advantages of Elastic Net:

✅ Works well when there are many correlated features (like Ridge).

✅ Performs automatic feature selection (like Lasso).

✅ Balances the trade-off between Ridge and Lasso.

Observations

📌 Ridge Regression
- Keeps all variables but shrinks them.
- Good for handling multicollinearity but does not perform feature selection.

📌 Lasso Regression
- Eliminates some variables (feature selection).
- May drop too many variables when predictors are correlated.

📌 Elastic Net Regression
- Balances Ridge and Lasso:
    - Shrinks some coefficients (like Ridge).
    - Sets others to zero (like Lasso).
- Has lower RMSE than both Ridge and Lasso → Best overall performance.

When to Use Each Model?
- Ridge
    - All predictors contribute equally
    - Best when all variables matter and are correlated.
- Lasso
    - Many irrelevant features present
    - Best when some variables should be removed.
- Elastic Net
    - Features are correlated & some should be removed
    - Best balanced approach when features are correlated but feature selection is also needed.

##### To Compare Ridge, Lasso, and Elastic Net

In [None]:
from sklearn.linear_model import ElasticNet

# Define models
ridge = Ridge(alpha=1)
lasso = Lasso(alpha=0.1)
elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5)  # Mix of Ridge & Lasso

# Fit models
ridge.fit(X_train, y_train)
lasso.fit(X_train, y_train)
elastic_net.fit(X_train, y_train)

# Predictions
y_pred_ridge = ridge.predict(X_test)
y_pred_lasso = lasso.predict(X_test)
y_pred_elastic = elastic_net.predict(X_test)

# Compute RMSE
ridge_rmse = np.sqrt(mean_squared_error(y_test, y_pred_ridge))
lasso_rmse = np.sqrt(mean_squared_error(y_test, y_pred_lasso))
elastic_rmse = np.sqrt(mean_squared_error(y_test, y_pred_elastic))

# Print coefficients and RMSE
print("Ridge Regression Coefficients:", ridge.coef_)
print("Ridge Regression RMSE:", ridge_rmse)

print("Lasso Regression Coefficients:", lasso.coef_)
print("Lasso Regression RMSE:", lasso_rmse)

print("Elastic Net Regression Coefficients:", elastic_net.coef_)
print("Elastic Net Regression RMSE:", elastic_rmse)


##### Hyperparameter Tuning for Elastic Net using GridSearchCV
To get the best performance from Elastic Net, we need to tune its two main hyperparameters:
- alpha (λ): Controls the overall regularization strength.
- l1_ratio (ρ): Determines the mix of Lasso (L1) and Ridge (L2) penalties.

We will use GridSearchCV to find the best combination of these parameters.

Explanation of the Code
- ✅ Step 1: Define a grid of possible values for alpha (regularization) and l1_ratio (mix of Ridge & Lasso).
- ✅ Step 2: Use GridSearchCV to find the best combination using cross-validation (cv=5).
- ✅ Step 3: Train the model on X_train and identify the best parameters.
- ✅ Step 4: Use the best model to predict on X_test.
- ✅ Step 5: Compute RMSE to evaluate performance.

Output:
- The best alpha and l1_ratio values are chosen based on lowest RMSE.
- The optimized model performs better than default Elastic Net.

When Should You Tune Hyperparameters?
- 📌 When the dataset has many features and we need to balance Ridge & Lasso.
- 📌 When we want to avoid overfitting (too little regularization) or underfitting (too much regularization).
- 📌 When we have high multicollinearity and some variables should be removed.

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import ElasticNet
from sklearn.metrics import mean_squared_error
import numpy as np

# Define the parameter grid
param_grid = {
    'alpha': [0.01, 0.1, 1, 10, 100],  # Regularization strength
    'l1_ratio': [0.1, 0.3, 0.5, 0.7, 0.9, 1.0]  # Ratio of Lasso vs Ridge
}

# Define Elastic Net model
elastic_net = ElasticNet()

# Set up GridSearchCV
grid_search = GridSearchCV(elastic_net, param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1)

# Fit the model
grid_search.fit(X_train, y_train)

# Get the best parameters
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

# Predictions using the best model
y_pred = best_model.predict(X_test)

# Compute RMSE
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

# Print the results
print("Best Hyperparameters:", best_params)
print("Best Elastic Net RMSE:", rmse)


##### **When Lasso Performs Better Than Ridge Regression**
Lasso regression is expected to outperform Ridge regression when:

Only a Few Predictors Are Important (Sparsity Exists)
- If the true underlying model has only a few nonzero coefficients (i.e., most features are irrelevant), Lasso is better because it can shrink some coefficients to exactly zero, effectively performing feature selection.
- Example: Predicting house prices where only a few key variables (e.g., square footage, location) matter, while others (e.g., paint color) do not.

High Dimensionality (More Features Than Observations, p>n)
- When the number of features exceeds the number of observations, Lasso helps reduce dimensionality by selecting a subset of relevant features.
- Example: In genetics, where thousands of genes (p) are analyzed with limited patient samples (n).

Interpretable Models Are Needed
- Because Lasso sets some coefficients to zero, the final model contains only important predictors, making it more interpretable.

High Correlation Among Predictors (But Only One or Few Are Truly Relevant)
- If some predictors are highly correlated, Lasso will select only one and shrink the others to zero, simplifying the model.

##### **When Ridge Regression Performs Better Than Lasso**
Ridge regression is expected to perform better than Lasso when:

Many Small or Moderately Important Predictors Exist (No True Sparsity)
- If the true model has many predictors with nonzero coefficients, Ridge is preferred because it shrinks all coefficients continuously rather than eliminating some.
- Example: In macroeconomic forecasting, where multiple variables (e.g., inflation, interest rates, unemployment) contribute jointly to the outcome.

Multicollinearity Exists (Highly Correlated Predictors)
- Ridge handles multicollinearity better by shrinking correlated predictors rather than eliminating them.
- Unlike Lasso, which might arbitrarily drop some correlated predictors, Ridge retains them with smaller weights.

Better Prediction Accuracy Rather Than Feature Selection
- If the goal is prediction accuracy rather than interpretability, Ridge often performs better since it retains all features, preventing the loss of useful information.

Small Sample Size, Many Weak Predictors
- When there are many weak predictors (all contributing a little), Ridge works better because it preserves all features rather than eliminating some.

Conclusion
- Use Lasso when you expect sparsity (i.e., only a few important predictors).
- Use Ridge when all predictors contribute meaningfully, and multicollinearity is a concern.
- If unsure, Elastic Net (a mix of Ridge and Lasso) can be a good compromise.

_______________________

##### **Relationship Between Best Subset Selection, Lasso, and Ridge Regression**

Best Subset Selection
- Best subset selection finds the best subset of predictors by evaluating all possible combinations of predictors and choosing the one that minimizes a given criterion (e.g., least squares error).
- This approach is computationally expensive, especially when there are many predictors, because it requires evaluating $2_𝑝$ possible subsets for p predictors.
- It performs hard thresholding, meaning variables are either included or excluded, with no in-between shrinkage.

**Lasso Regression vs. Best Subset Selection**

Similarities:
- Both perform variable selection, meaning they reduce the number of predictors in the final model.
- Lasso can be seen as a continuous relaxation of best subset selection since it automatically selects variables by shrinking some coefficients to exactly zero.

Differences:
- Computational Efficiency: Lasso is computationally more efficient than best subset selection, which becomes infeasible for large 𝑝.
- Bias-Variance Tradeoff: Lasso introduces some bias by shrinking coefficients, whereas best subset selection leads to models with lower bias but higher variance due to overfitting.
- Feature Selection Method: Lasso selects variables continuously (soft thresholding), while best subset selection is discrete (hard thresholding).

**Ridge Regression vs. Best Subset Selection**

Similarities:
- Both aim to improve prediction accuracy compared to standard least squares regression by reducing overfitting.

Differences:
- Feature Selection: Ridge regression does not perform variable selection—it shrinks coefficients continuously but never sets them to exactly zero. In contrast, best subset selection results in a sparse model.
- Handling Multicollinearity: Ridge is better suited for multicollinearity since it distributes the coefficient values among correlated predictors rather than selecting only one, as best subset selection does.

Key Takeaways
- Best subset selection is ideal when p is small but computationally expensive when p is large.
- Lasso is a computationally feasible alternative that performs continuous feature selection.
- Ridge does not select variables but is better suited for situations where all predictors contribute meaningfully to the response.
- Elastic Net (a hybrid of Ridge and Lasso) can combine the benefits of both, balancing selection and shrinkage.

__________________

##### **Rule Governing the Selection of the Tuning Parameter for Lasso and Ridge Regression**

The tuning parameter (λ) in Lasso and Ridge regression controls the strength of regularization and is selected to optimize the bias-variance tradeoff. 
- The rule for selecting λ follows these principles:

**Selecting λ Using Cross-Validation**
- The most common approach for tuning λ is k-fold cross-validation:
    - Split the data into k subsets (folds).
    - Train the model on 𝑘−1 folds and validate on the remaining fold.
    - Repeat the process for different values of λ and choose the one that minimizes prediction error, e.g.:
        - Mean Squared Error for Ridge, 
        - Mean Absolute Error for Lasso.
    - The optimal λ balances model complexity and predictive accuracy.

**Effect of λ in Lasso and Ridge Regression**
- (a) Ridge Regression (𝐿2 Regularization)
    - Small 𝜆 → Similar to Ordinary Least Squares (OLS)
        - Minimal shrinkage, model may overfit.
    - Large 𝜆 → Stronger Shrinkage
        - Coefficients shrink but remain nonzero.
        - Reduces multicollinearity and prevents overfitting but may introduce bias.
    - Optimal λ: Typically selected to minimize cross-validated prediction error.

- (b) Lasso Regression (𝐿1 Regularization)
    - Small 𝜆 → Model Similar to OLS
        - Minimal shrinkage, overfitting risk.
    - Moderate 𝜆 → Some Coefficients Shrink to Zero
        - Acts as feature selection.
    - Large 𝜆 → More Coefficients Become Zero
        - Can underfit if too large, as it removes too many variables.
    - Optimal λ Selection: Chosen using cross-validation to balance sparsity and predictive performance.

**Common Methods for Selecting λ**
- Cross-validation (CV): Most widely used method.
- Grid Search: Trying different λ values systematically.
- Generalized Cross-Validation (GCV): An approximation of leave-one-out CV, often used in Ridge regression.
- Information Criteria (AIC, BIC): Used in some cases to select 𝜆 based on model fit and complexity.

______________

##### **Expected Values of Coefficients for Signal and Noise Variables in a Robust Regression Model**
In a robust regression model (such as Ridge, Lasso, or Huber regression), the behavior of estimated coefficients for signal (true predictors) and noise (irrelevant predictors) depends on the regularization method and the tuning parameter 𝜆.

Definitions
- Signal Variables: Variables that have a real effect on the response variable (true predictors). Their coefficients should ideally be close to their true values.
- Noise Variables: Variables that have no true relationship with the response. Their coefficients should ideally be shrunk toward zero to avoid overfitting.

**Expected Behavior of Coefficients in Robust Regression**
- (a) Ridge Regression (𝐿2 Regularization)
    - Shrinks all coefficients proportionally but never sets them exactly to zero.
    - Expected Values of Coefficients:
        - Signal Variables: Shrunk toward zero, but retain most of their information.
        - Noise Variables: Also shrunk toward zero, but not completely eliminated.
    - Key Effect:
        - Ridge keeps small contributions from all variables, making it less interpretable but good for multicollinearity.

- (b) Lasso Regression (𝐿1 Regularization)
    - Performs feature selection by setting some coefficients exactly to zero.
    - Expected Values of Coefficients:
        - Signal Variables: Estimated with some bias (due to shrinkage), but often closer to their true values than in Ridge.
        - Noise Variables: Many coefficients are exactly zero, effectively removing unimportant predictors.
    - Key Effect:
        - Lasso provides a sparse model and improves interpretability by selecting only relevant features.

- (c) Elastic Net (Combination of 𝐿1 and 𝐿2 Regularization)
    - Balances Ridge and Lasso properties.
    - Expected Values of Coefficients:  
        - Signal Variables: Shrunk moderately (like Ridge) but some may be eliminated (like Lasso).
        - Noise Variables: Some eliminated, others weakly shrunk.
    - Key Effect:
        - Works well when features are correlated, handling feature selection and regularization simultaneously.

- (d) Huber Regression (Robust to Outliers)
    - Reduces the impact of extreme outliers using a loss function that switches from squared loss to absolute loss for large residuals.
    - Expected Values of Coefficients:
        - Signal Variables: Less affected by outliers, resulting in a more stable estimate.
        - Noise Variables: Naturally reduced in influence since extreme values are downweighted.
    - Key Effect:
        - Provides robustness against outliers while maintaining good prediction performance.

| Model Type |Signal Variables | Noise Variables |
|---------------|--------|----------------------------------| 
|Ridge Regression|Shrunk but nonzero |Shrunk but nonzero|
|Lasso Regression|Shrunk, some zeroed |Many exactly zero |
|Elastic Net |Shrunk, some zeroed|Some zeroed, some shrunk |
|Huber Regression |Stable against outliers |Less influenced by extreme values |

Takeaway
- Ridge keeps all predictors but shrinks them.
- Lasso eliminates noise variables by setting some coefficients to exactly zero.
- Elastic Net combines both Ridge and Lasso properties, handling correlated predictors.
- Huber regression provides robustness against outliers, reducing noise impact.

##### Step-by-step Python example demonstrating how Ridge, Lasso, Elastic Net, and Huber regression handle signal and noise variables.
- Generate synthetic data with true signal and noise variables.
- Apply Ridge, Lasso, Elastic Net, and Huber regression.
- Compare how each method estimates coefficients for signal and noise variables.

Key Observations from the Output

Ridge Regression:
- Shrinks all coefficients but never sets them exactly to zero.
- Retains small values for noise variables.

Lasso Regression:
- Many coefficients are exactly zero, removing irrelevant features.
- Useful for feature selection.

Elastic Net:
- Balances Ridge and Lasso, keeping some variables while shrinking others.

Huber Regression:
- More stable against outliers, coefficients remain relatively close to the true values.

Model Performance (MSE):
- Ridge and Elastic Net usually perform better in highly correlated features.
- Lasso performs better when there are truly sparse models (few relevant predictors).
- Huber is useful when the dataset contains outliers.

Final Thoughts
- If the goal is prediction (not feature selection), Ridge or Elastic Net work well.
- If the goal is feature selection, Lasso is the best choice.
- For robustness against outliers, Huber regression is ideal.

In [None]:
# Step 1: Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge, Lasso, ElasticNet, HuberRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error

# Set random seed for reproducibility
np.random.seed(42)

# Step 2: Generate Data with Signal & Noise Variables
# Generate 100 samples and 10 features
n_samples, n_features = 100, 10

# Create signal variables (true predictors)
X_signal = np.random.randn(n_samples, 3)  # 3 true predictors
beta_signal = np.array([3, -2, 1])  # True coefficients

# Create noise variables (irrelevant predictors)
X_noise = np.random.randn(n_samples, n_features - 3)  # 7 noise variables

# Combine signal and noise variables
X = np.hstack([X_signal, X_noise])

# Generate response variable with some noise
y = X_signal @ beta_signal + np.random.randn(n_samples) * 0.5  # Add some noise

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize predictors (important for regularization methods)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Step 3: Train Ridge, Lasso, Elastic Net, and Huber Models
# Define models
ridge = Ridge(alpha=1.0)
lasso = Lasso(alpha=0.1)
elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5)  # 50% Lasso, 50% Ridge
huber = HuberRegressor(alpha=0.1)

# Train models
ridge.fit(X_train, y_train)
lasso.fit(X_train, y_train)
elastic_net.fit(X_train, y_train)
huber.fit(X_train, y_train)

# Get coefficients
ridge_coefs = ridge.coef_
lasso_coefs = lasso.coef_
elastic_net_coefs = elastic_net.coef_
huber_coefs = huber.coef_

# Step 4: Compare Coefficients for Signal and Noise Variables
# Create a dataframe to compare coefficient estimates
coef_df = pd.DataFrame({
    'Feature': [f'X{i+1}' for i in range(n_features)],
    'Ridge': ridge_coefs,
    'Lasso': lasso_coefs,
    'Elastic Net': elastic_net_coefs,
    'Huber': huber_coefs
})

# Sort by absolute value of true signal importance
coef_df['True Importance'] = np.concatenate((beta_signal, np.zeros(n_features - 3)))
coef_df = coef_df.sort_values(by='True Importance', ascending=False, key=abs)

print(coef_df)

# Step 5: Evaluate Model Performance
# Make predictions
ridge_pred = ridge.predict(X_test)
lasso_pred = lasso.predict(X_test)
elastic_net_pred = elastic_net.predict(X_test)
huber_pred = huber.predict(X_test)

# Compute Mean Squared Error
ridge_mse = mean_squared_error(y_test, ridge_pred)
lasso_mse = mean_squared_error(y_test, lasso_pred)
elastic_net_mse = mean_squared_error(y_test, elastic_net_pred)
huber_mse = mean_squared_error(y_test, huber_pred)

# Display results
print("\nModel Performance (MSE):")
print(f"Ridge MSE: {ridge_mse:.4f}")
print(f"Lasso MSE: {lasso_mse:.4f}")
print(f"Elastic Net MSE: {elastic_net_mse:.4f}")
print(f"Huber MSE: {huber_mse:.4f}")

# Step 6: Visualizing Coefficient Shrinkage
import seaborn as sns

# Plot coefficient values for each model
plt.figure(figsize=(12, 6))
coef_df_melted = coef_df.melt(id_vars=['Feature', 'True Importance'], var_name='Model', value_name='Coefficient')

sns.barplot(data=coef_df_melted, x='Feature', y='Coefficient', hue='Model', palette='coolwarm')
plt.axhline(0, color='black', linestyle='--', linewidth=1)
plt.xticks(rotation=45)
plt.title("Coefficient Estimates Across Models")
plt.show()


##### Refine the analysis by tuning the hyperparameters for Ridge, Lasso, Elastic Net, and Huber regression using cross-validation. 
- This will help us find the best values for alpha (regularization strength) and l1_ratio (for Elastic Net).

Key Improvements in This Version
- Uses Cross-Validation: Ensures the best hyperparameter selection based on performance across multiple data splits.
- Finds Optimal Alpha & L1 Ratio (Elastic Net): Reduces overfitting and improves generalization.
- Final Model Training with Optimal Parameters: Ensures the best version of each model is used for predictions.

In [None]:
# Step 1: Import Required Libraries
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

# Step 2: Define the Hyperparameter Grid for Each Model
# Define hyperparameter grids
ridge_params = {'alpha': np.logspace(-3, 3, 10)}  # 10 values from 0.001 to 1000
lasso_params = {'alpha': np.logspace(-3, 3, 10)}
elastic_net_params = {
    'alpha': np.logspace(-3, 3, 10),
    'l1_ratio': [0.1, 0.5, 0.9]  # Mix between Lasso and Ridge
}
huber_params = {'alpha': np.logspace(-3, 3, 10)}

# Define models
ridge = Ridge()
lasso = Lasso()
elastic_net = ElasticNet()
huber = HuberRegressor()

# Step 3: Perform Grid Search with Cross-Validation
# Perform Grid Search
ridge_cv = GridSearchCV(ridge, ridge_params, cv=5, scoring='neg_mean_squared_error')
lasso_cv = GridSearchCV(lasso, lasso_params, cv=5, scoring='neg_mean_squared_error')
elastic_net_cv = GridSearchCV(elastic_net, elastic_net_params, cv=5, scoring='neg_mean_squared_error')
huber_cv = GridSearchCV(huber, huber_params, cv=5, scoring='neg_mean_squared_error')

# Fit models
ridge_cv.fit(X_train, y_train)
lasso_cv.fit(X_train, y_train)
elastic_net_cv.fit(X_train, y_train)
huber_cv.fit(X_train, y_train)

# Step 4: Get the Best Hyperparameters
print("Best Hyperparameters:")
print(f"Ridge Best Alpha: {ridge_cv.best_params_['alpha']}")
print(f"Lasso Best Alpha: {lasso_cv.best_params_['alpha']}")
print(f"Elastic Net Best Alpha: {elastic_net_cv.best_params_['alpha']}, Best L1 Ratio: {elastic_net_cv.best_params_['l1_ratio']}")
print(f"Huber Best Alpha: {huber_cv.best_params_['alpha']}")

# Step 5: Train Final Models with Optimal Parameters
# Train models with best parameters
ridge_best = Ridge(alpha=ridge_cv.best_params_['alpha']).fit(X_train, y_train)
lasso_best = Lasso(alpha=lasso_cv.best_params_['alpha']).fit(X_train, y_train)
elastic_net_best = ElasticNet(alpha=elastic_net_cv.best_params_['alpha'], l1_ratio=elastic_net_cv.best_params_['l1_ratio']).fit(X_train, y_train)
huber_best = HuberRegressor(alpha=huber_cv.best_params_['alpha']).fit(X_train, y_train)

# Step 6: Evaluate Performance
# Make predictions
ridge_pred = ridge_best.predict(X_test)
lasso_pred = lasso_best.predict(X_test)
elastic_net_pred = elastic_net_best.predict(X_test)
huber_pred = huber_best.predict(X_test)

# Compute Mean Squared Error
ridge_mse = mean_squared_error(y_test, ridge_pred)
lasso_mse = mean_squared_error(y_test, lasso_pred)
elastic_net_mse = mean_squared_error(y_test, elastic_net_pred)
huber_mse = mean_squared_error(y_test, huber_pred)

# Display results
print("\nOptimized Model Performance (MSE):")
print(f"Ridge MSE: {ridge_mse:.4f}")
print(f"Lasso MSE: {lasso_mse:.4f}")
print(f"Elastic Net MSE: {elastic_net_mse:.4f}")
print(f"Huber MSE: {huber_mse:.4f}")

##### Feature Importance Analysis and Learning Curves for Ridge, Lasso, Elastic Net, and Huber Regression
Extend our analysis by:
- ✅ Identifying important features (especially for Lasso and Elastic Net)
- ✅ Visualizing learning curves to check model performance over training iterations

Insights from the Visualizations

Feature Importance
- Lasso and Elastic Net shrink many coefficients to zero (feature selection).
- Ridge shrinks coefficients but does not eliminate any variables.

Learning Curves
- If training and validation errors converge, the model generalizes well.
- If there's a large gap between errors, the model might be overfitting or underfitting.
- Lasso and Elastic Net often converge faster, thanks to feature selection.

In [None]:
# Step 1: Import Additional Libraries
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Step 2: Feature Importance (Lasso and Elastic Net)
# Lasso and Elastic Net perform feature selection by shrinking some coefficients to zero. Let's visualize the most important features.
# Extract feature importance (coefficients)
lasso_coeffs = lasso_best.coef_
elastic_net_coeffs = elastic_net_best.coef_
ridge_coeffs = ridge_best.coef_

# Create a feature importance DataFrame
feature_names = X_train.columns
importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Lasso': lasso_coeffs,
    'Elastic Net': elastic_net_coeffs,
    'Ridge': ridge_coeffs
})

# Sort by Lasso importance
importance_df = importance_df.sort_values(by="Lasso", ascending=False)

# Plot feature importance
plt.figure(figsize=(12, 6))
sns.barplot(x='Feature', y='Lasso', data=importance_df, color='blue', label="Lasso")
sns.barplot(x='Feature', y='Elastic Net', data=importance_df, color='orange', alpha=0.7, label="Elastic Net")
plt.xticks(rotation=45, ha="right")
plt.title("Feature Importance: Lasso vs Elastic Net")
plt.legend()
plt.show()

# Step 3: Learning Curve Visualization
# A learning curve shows how the model performs as we get more training data.
from sklearn.model_selection import learning_curve

def plot_learning_curve(model, X, y, title):
    train_sizes, train_scores, test_scores = learning_curve(
        model, X, y, cv=5, scoring="neg_mean_squared_error", train_sizes=np.linspace(0.1, 1.0, 10)
    )

    train_mean = -np.mean(train_scores, axis=1)
    test_mean = -np.mean(test_scores, axis=1)

    plt.figure(figsize=(8, 5))
    plt.plot(train_sizes, train_mean, 'o-', label="Training Error", color="red")
    plt.plot(train_sizes, test_mean, 'o-', label="Validation Error", color="blue")
    plt.xlabel("Training Set Size")
    plt.ylabel("Mean Squared Error")
    plt.title(title)
    plt.legend()
    plt.show()

# Plot learning curves
plot_learning_curve(ridge_best, X_train, y_train, "Learning Curve - Ridge Regression")
plot_learning_curve(lasso_best, X_train, y_train, "Learning Curve - Lasso Regression")
plot_learning_curve(elastic_net_best, X_train, y_train, "Learning Curve - Elastic Net Regression")
plot_learning_curve(huber_best, X_train, y_train, "Learning Curve - Huber Regression")

##### Comparing Ridge, Lasso, Elastic Net, and Huber Regression on Multiple Metrics

Compare the models using different performance metrics:
- ✅ Mean Squared Error (MSE) – Measures overall error, penalizing large deviations.
- ✅ Root Mean Squared Error (RMSE) – Helps interpret error in original units.
- ✅ Mean Absolute Error (MAE) – Less sensitive to large outliers than MSE.
- ✅ R² Score – Explains how much variance in the target is captured by the model.

Insights from the Comparison

Ridge vs. Lasso vs. Elastic Net vs. Huber
- Ridge Regression performs well when all predictors matter (low bias, higher variance).
- Lasso Regression is useful for feature selection (eliminates unimportant variables).
- Elastic Net balances Ridge and Lasso, working well with correlated features.
- Huber Regression is robust to outliers, making it ideal when data has extreme values.

Which Model to Choose?
- If many features are irrelevant → Lasso or Elastic Net is better.
- If all features contribute to the outcome → Ridge performs best.
- If outliers are present → Huber is the best choice.

In [None]:
# Step 1: Define Performance Metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np

# Define function to calculate all metrics
def evaluate_model(model, X_test, y_test, model_name):
    y_pred = model.predict(X_test)
    
    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    mae = mean_absolute_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    
    return {
        "Model": model_name,
        "MSE": round(mse, 4),
        "RMSE": round(rmse, 4),
        "MAE": round(mae, 4),
        "R² Score": round(r2, 4)
    }

# Evaluate all models
ridge_results = evaluate_model(ridge_best, X_test, y_test, "Ridge Regression")
lasso_results = evaluate_model(lasso_best, X_test, y_test, "Lasso Regression")
elastic_net_results = evaluate_model(elastic_net_best, X_test, y_test, "Elastic Net Regression")
huber_results = evaluate_model(huber_best, X_test, y_test, "Huber Regression")

# Create DataFrame for comparison
import pandas as pd
results_df = pd.DataFrame([ridge_results, lasso_results, elastic_net_results, huber_results])
print(results_df)

# Step 2: Visualizing Model Performance
# Plot bar charts for metric comparison
fig, axes = plt.subplots(2, 2, figsize=(12, 8))
metrics = ["MSE", "RMSE", "MAE", "R² Score"]
colors = ["blue", "orange", "green", "red"]
models = ["Ridge Regression", "Lasso Regression", "Elastic Net Regression", "Huber Regression"]

for i, metric in enumerate(metrics):
    ax = axes[i//2, i%2]
    sns.barplot(x=models, y=results_df[metric], palette=colors, ax=ax)
    ax.set_title(f"Comparison of {metric}")
    ax.set_xticklabels(models, rotation=45, ha="right")

plt.tight_layout()
plt.show()

##### Hyperparameter Tuning for Ridge, Lasso, Elastic Net, and Huber Regression
Now, let’s optimize the hyperparameters for each model using Grid Search Cross-Validation (GridSearchCV) and Randomized Search for efficiency.

Step 1: Define Hyperparameter Grids

Each model has key parameters:
- Ridge Regression: alpha (penalty strength)
- Lasso Regression: alpha (penalty strength)
- Elastic Net: alpha (penalty strength), l1_ratio (balance between Lasso and Ridge)
- Huber Regression: epsilon (controls outlier sensitivity)

In [None]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.linear_model import Ridge, Lasso, ElasticNet, HuberRegressor
import numpy as np

# Define hyperparameter grids
ridge_params = {"alpha": np.logspace(-3, 3, 10)}  # Alpha from 0.001 to 1000
lasso_params = {"alpha": np.logspace(-3, 3, 10)}
elastic_net_params = {
    "alpha": np.logspace(-3, 3, 10),
    "l1_ratio": np.linspace(0.1, 1, 10)  # Mix of Lasso (1.0) and Ridge (0.1)
}
huber_params = {"epsilon": np.linspace(1.1, 2.0, 10)}  # Controls outlier sensitivity

Step 2: Perform Hyperparameter Tuning

We will use GridSearchCV for Ridge, Lasso, and Elastic Net (small search space), and RandomizedSearchCV for Huber (since it is slower).

In [None]:
# Ridge Regression - Grid Search
ridge_cv = GridSearchCV(Ridge(), ridge_params, scoring="neg_mean_squared_error", cv=5)
ridge_cv.fit(X_train, y_train)
best_ridge = ridge_cv.best_estimator_

# Lasso Regression - Grid Search
lasso_cv = GridSearchCV(Lasso(), lasso_params, scoring="neg_mean_squared_error", cv=5)
lasso_cv.fit(X_train, y_train)
best_lasso = lasso_cv.best_estimator_

# Elastic Net - Grid Search
elastic_net_cv = GridSearchCV(ElasticNet(), elastic_net_params, scoring="neg_mean_squared_error", cv=5)
elastic_net_cv.fit(X_train, y_train)
best_elastic_net = elastic_net_cv.best_estimator_

# Huber Regression - Randomized Search (efficient for larger search space)
huber_cv = RandomizedSearchCV(HuberRegressor(), huber_params, scoring="neg_mean_squared_error", cv=5, n_iter=10, random_state=42)
huber_cv.fit(X_train, y_train)
best_huber = huber_cv.best_estimator_

# Step 3: Display Best Hyperparameters
print(f"Best Ridge Alpha: {ridge_cv.best_params_['alpha']}")
print(f"Best Lasso Alpha: {lasso_cv.best_params_['alpha']}")
print(f"Best Elastic Net Alpha: {elastic_net_cv.best_params_['alpha']}, L1 Ratio: {elastic_net_cv.best_params_['l1_ratio']}")
print(f"Best Huber Epsilon: {huber_cv.best_params_['epsilon']}")

# Step 4: Evaluate Optimized Models
ridge_results = evaluate_model(best_ridge, X_test, y_test, "Optimized Ridge")
lasso_results = evaluate_model(best_lasso, X_test, y_test, "Optimized Lasso")
elastic_net_results = evaluate_model(best_elastic_net, X_test, y_test, "Optimized Elastic Net")
huber_results = evaluate_model(best_huber, X_test, y_test, "Optimized Huber")

results_df = pd.DataFrame([ridge_results, lasso_results, elastic_net_results, huber_results])
print(results_df)


Insights from Hyperparameter Tuning
- Ridge and Lasso models benefit from optimal alpha values, improving performance.
- Elastic Net finds the best mix between Lasso (L1) and Ridge (L2) penalties.
- Huber Regression adjusts epsilon, improving robustness to outliers.

# 4. Generalized Linear Models (GLMs)

### What are Generalized Linear Models (GLMs)?
Generalized Linear Models (GLMs) extend linear regression by allowing the response variable (outcome) to follow a distribution from the exponential family (e.g., 
- Normal, 
- Poisson, 
- Binomial, 
- Gamma) 
Rather than just assuming normality.

Standard linear regression
- Which models the relationship between predictors and the outcome assuming constant variance and normally distributed errors

GLMs introduce:
- A Response Distribution: The outcome variable can follow a variety of distributions.
- A Link Function: Transforms the expected value of the response variable to a scale where a linear relationship with predictors holds.

What It Means: 
- GLMs extend linear regression by allowing different types of data distributions
    - Poisson for count data. 
- It models the mean of the outcome variable based on a link function.

Outcome Interpretation: 
- The coefficients explain how each predictor affects the mean outcome, given the distribution.

Performance Measures:
- Deviance: Measures how well the model fits compared to a perfect model; lower values are better.

Lay Explanation: 
- GLMs are like flexible versions of linear regression that can handle different data types (like counts or binary data), giving predictions that respect the data’s nature.

Use Case: 
- Extends linear regression for non-normal distributions (e.g., Poisson regression for count data).

Model Types: 
- Poisson regression, 
- Binomial regression.

### GLM Components
A GLM consists of three main components:
- Random Component – Specifies the probability distribution of the response variable, e.g.: 
    - Normal, 
    - Poisson, 
    - Binomial.
- Systematic Component – A linear predictor $η  = β_0 + β_1 X_1 + β_2 X_2$
- Link Function – Connects the expected value of the response variable (𝐸(𝑌)) to the linear predictor:
$$ g(E(Y)) = η$$

### Types of GLMs and When to Use Them
Different types of GLMs are used depending on the nature of the outcome variable and its distribution:

| Type of GLM | Response Variable (Y) | Common Use Cases  | Link Function  | Example  |
|---------------|--------|----------------------------------| -----------------------| -----------------------|  
|Linear Regression| Continuous (Normal) |Predicting continuous outcomes| identity: $g(\mu) = \mu$ |  Predicting house prices |
|Logistic Regression| Binary (0/1) |Binary classification problems| Logit: $g(\mu) = log(\frac{\mu}{1- \mu})$ |  Predicting customer churn (Yes/No) |
|Poisson Regression| Count data (non-negative integers) |Predicting counts (e.g., number of events)| Log: $g(\mu) = log(\mu)$ |Modeling number of claims in insurance |
|Negative Binomial Regression| Overdispersed count data | When Poisson model shows high variance| Log: $g(\mu) = log(\mu)$ | Number of hospital visits per year |
|Gamma Regression| Positive continuous data |Skewed response variables (e.g., waiting times)| Inverse: $g(\mu) = \frac{1}{\mu}$ | Predicting time to process a transaction |

### How to Choose the Right GLM?

Identify the Nature of the Response Variable
- Continuous (Normal distribution) → Use Linear Regression
- Binary (0/1, Yes/No) → Use Logistic Regression
- Counts (0,1,2,3,…, Non-negative integers) → Use Poisson Regression
- Overdispersed counts (variance > mean) → Use Negative Binomial Regression
- Skewed continuous data (e.g., time, waiting durations) → Use Gamma Regression

Check for Overdispersion (Variance > Mean in Count Data)
- If Poisson regression underestimates variance, switch to Negative Binomial Regression.

Check for Distribution Shape
- If the response is right-skewed, consider Gamma Regression.

Interpretability of the Model
- Ensure the chosen link function aligns with the problem and makes sense for predictions.

##### Generalized Linear Models (GLMs) in Python using the statsmodels library.
- Linear Regression (Normal Distribution)
- Logistic Regression (Binary Classification)
- Poisson Regression (Count Data)
- Negative Binomial Regression (Overdispersed Count Data)
- Gamma Regression (Skewed Continuous Data)

In [None]:
# Step 1: Install Dependencies
# Make sure you have statsmodels, pandas, and numpy installed:
pip install statsmodels pandas numpy

# Step 2: Load Libraries
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt

**Linear Regression (Continuous Outcome)**

📌 Example: Predict house prices based on square footage.
- Distribution: Normal
- Link Function: Identity $g(\mu) = \mu$

Interpretation: The model estimates house price based on square footage using a normal distribution.

In [None]:
# Simulate Data
np.random.seed(42)
n = 100
X = np.random.uniform(1000, 4000, n)  # Square footage
y = 50000 + 150 * X + np.random.normal(0, 50000, n)  # House Price with noise

# Convert to DataFrame
df = pd.DataFrame({'SquareFootage': X, 'HousePrice': y})

# Fit Linear Regression using GLM
model = smf.glm('HousePrice ~ SquareFootage', data=df, family=sm.families.Gaussian()).fit()

# Print Summary
print(model.summary())

**Logistic Regression (Binary Outcome)**

📌 Example: Predict whether a customer will churn (1 = Yes, 0 = No).
- Distribution: Binomial
- Link Function: Logit $g(\mu) = log(\frac{\mu}{1- \mu})$

 Interpretation: The coefficients indicate how monthly spend affects the probability of customer churn.

In [None]:
# Simulate Data
np.random.seed(42)
n = 200
X = np.random.uniform(500, 3000, n)  # Monthly Spend
p = 1 / (1 + np.exp(-(-3 + 0.002 * X)))  # Logistic function
y = np.random.binomial(1, p, n)  # Churn: 0 or 1

# Convert to DataFrame
df = pd.DataFrame({'MonthlySpend': X, 'Churn': y})

# Fit Logistic Regression using GLM
model = smf.glm('Churn ~ MonthlySpend', data=df, family=sm.families.Binomial()).fit()

# Print Summary
print(model.summary())

**Poisson Regression (Count Outcome)**

📌 Example: Predict the number of claims per customer in an insurance dataset.
- Distribution: Poisson
- Link Function: Log $g(\mu) = log(\mu)$

Interpretation: The model estimates expected claims based on age.

In [None]:
# Simulate Data
np.random.seed(42)
n = 300
X = np.random.uniform(18, 65, n)  # Customer Age
y = np.random.poisson(lam=np.exp(0.1 * X - 4), size=n)  # Number of Claims

# Convert to DataFrame
df = pd.DataFrame({'Age': X, 'NumClaims': y})

# Fit Poisson Regression using GLM
model = smf.glm('NumClaims ~ Age', data=df, family=sm.families.Poisson()).fit()

# Print Summary
print(model.summary())


In [None]:
import statsmodels.api as sm
poisson_model = sm.GLM(y_train, X_train, family=sm.families.Poisson()).fit()
predictions = poisson_model.predict(X_test)

**Negative Binomial Regression (Overdispersed Count Data)**

📌 Example: Predict the number of visits to a hospital per year.
- Distribution: Negative Binomial (for overdispersed count data)
- Link Function: Log $g(\mu) = log(\mu)$

Interpretation: Negative Binomial handles cases where variance exceeds the mean.

In [None]:
# Simulate Data
np.random.seed(42)
n = 300
X = np.random.uniform(20, 80, n)  # Patient Age
y = np.random.negative_binomial(n=2, p=0.5, size=n)  # Overdispersed Count Data

# Convert to DataFrame
df = pd.DataFrame({'Age': X, 'HospitalVisits': y})

# Fit Negative Binomial Regression using GLM
model = smf.glm('HospitalVisits ~ Age', data=df, family=sm.families.NegativeBinomial()).fit()

# Print Summary
print(model.summary())


**Gamma Regression (Skewed Continuous Data)**

📌 Example: Predict the time taken to process a transaction.
- Distribution: Gamma (for right-skewed continuous data)
- Link Function: Inverse $g(\mu) = \frac{1}{\mu}$

Interpretation: This model is useful for modeling positively skewed transaction processing times.

In [None]:
# Simulate Data
np.random.seed(42)
n = 200
X = np.random.uniform(100, 1000, n)  # Transaction Complexity
y = np.random.gamma(shape=2, scale=X / 200, size=n)  # Processing Time

# Convert to DataFrame
df = pd.DataFrame({'TransactionComplexity': X, 'ProcessingTime': y})

# Fit Gamma Regression using GLM
model = smf.glm('ProcessingTime ~ TransactionComplexity', data=df, family=sm.families.Gamma(link=sm.families.links.inverse())).fit()

# Print Summary
print(model.summary())


#### To understand why penalizing the magnitude of coefficients should work in the first place.

To understand the impact of model complexity on the magnitude of coefficients, simulated a sine curve (between 60° and 300°) and added some random noise.

Resembles a sine curve but not exactly because of the noise.

Estimate the sine function using polynomial regression with powers of x from 1 to 15. Let’s add a column for each power upto 15 in our dataframe.

In [None]:
#Importing libraries. The same will be used throughout the article.
import numpy as np
import pandas as pd
import random

import matplotlib.pyplot as plt
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 12, 10

#Define input array with angles from 60deg to 300deg converted to radians
x = np.array([i*np.pi/180 for i in range(60,300,4)])
np.random.seed(10)  #Setting seed for reproducibility
y = np.sin(x) + np.random.normal(0,0.15,len(x))
data = pd.DataFrame(np.column_stack([x,y]),columns=['x','y'])
plt.plot(data['x'],data['y'],'.')
plt.show()

In [None]:
for i in range(2,16):  #power of 1 is already there
    colname = 'x_%d'%i      #new var will be x_power
    data[colname] = data['x']**i
print(data.head()) # add a column for each power upto 15 

#### Making 15 Different Linear Regression Models

we have all the 15 powers, let’s make 15 different linear regression models, with each model containing variables with powers of x from 1 to the particular model number.

Define a generic function that takes in the required maximum power of x as an input and returns a list containing 
- model RSS, 
- intercept, 
- coef_x, 
- coef_x2, … upto entered power 

Here RSS refers to the ‘Residual Sum of Squares,’ which is nothing but the sum of squares of errors between the predicted and actual values in the training data set and is known as the cost function or the loss function.

The function will not plot the model fit for all the powers but will return the RSS and coefficient values for all the models.

In [None]:
# Import Linear Regression model from scikit-learn.
from sklearn.linear_model import LinearRegression

def linear_regression(data, power, models_to_plot):
    #initialize predictors:
    predictors=['x']
    if power>=2:
        predictors.extend(['x_%d'%i for i in range(2,power+1)])
    
    #Fit the model
    linreg = LinearRegression(normalize=True)
    linreg.fit(data[predictors],data['y'])
    y_pred = linreg.predict(data[predictors])
    
    #Check if a plot is to be made for the entered power
    if power in models_to_plot:
        plt.subplot(models_to_plot[power])
        plt.tight_layout()
        plt.plot(data['x'],y_pred)
        plt.plot(data['x'],data['y'],'.')
        plt.title('Plot for power: %d'%power)
    
    #Return the result in pre-defined format
    rss = sum((y_pred-data['y'])**2)
    ret = [rss]
    ret.extend([linreg.intercept_])
    ret.extend(linreg.coef_)
    return ret

##### Store all the Results in Pandas Dataframe

Store all the results in a Pandas dataframe and plot 6 models to get an idea of the trend.

Expection: the models with increasing complexity to better fit the data and result in lower RSS values.
- As the model complexity increases, the models tend to fit even smaller deviations in the training data set. 
- Though this leads to overfitting, let’s keep this issue aside for some time and come to our main objective, i.e., the impact on the magnitude of coefficients.

In [None]:
# Initialize a dataframe to store the results:
col = ['rss','intercept'] + ['coef_x_%d'%i for i in range(1,16)]
ind = ['model_pow_%d'%i for i in range(1,16)]
coef_matrix_simple = pd.DataFrame(index=ind, columns=col)

# Define the powers for which a plot is required:
models_to_plot = {1:231,3:232,6:233,9:234,12:235,15:236}

# Iterate through all powers and assimilate results
for i in range(1,16):
    coef_matrix_simple.iloc[i-1,0:i+2] = linear_regression(data, power=i, models_to_plot=models_to_plot)

In [None]:
#Set the display format to be scientific for ease of analysis
pd.options.display.float_format = '{:,.2g}'.format
coef_matrix_simple

Its evident that the size of coefficients increases exponentially with an increase in model complexity.
- Intuition: into why putting a constraint on the magnitude of coefficients can be a good idea to reduce model complexity.

##### Large Coefficents Significance

It means that we’re putting a lot of emphasis on that feature, i.e., the particular feature is a good predictor for the outcome. 
- When it becomes too large, the algorithm starts modeling intricate relations to estimate the output and ends up overfitting the particular training data.

Solution
- ridge and lasso regression in detail 
- see how well they work for the same problem.

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

tree_model = DecisionTreeClassifier()
tree_model.fit(X_train, y_train)
rf_model = RandomForestClassifier()
rf_model.fit(X_train, y_train)
predictions_tree = tree_model.predict(X_test)
predictions_rf = rf_model.predict(X_test)


### **4.3. Poisson Regression**

Objective:
- We want to predict the number of insurance claims per customer using Poisson regression.

##### Understanding Poisson Regression
- Poisson regression is used for modeling count data, where the outcome variable represents the number of times an event occurs within a fixed period.
- Example: Number of insurance claims per customer in a year.

The Poisson distribution assumes that:
$$ P(Y = k) = \frac{e^{-\lambda} \lambda^{k}}{k!}$$
- where:
    - 𝑌 is the count outcome (e.g., number of claims)
    - 𝜆 (mean rate) represents the expected number of occurrences
    - k is the observed count

The Poisson regression model uses a log link function:
$$ log(\lambda) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_n X_n$$
- so that:
$$\lambda = e^{(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + ...)}$$
- This ensures that the predicted counts are always positive.

##### Assumptions of Poisson Regression
- Count Data: The dependent variable (Y) must be a count (0,1,2,...).
- Mean = Variance Assumption (Equidispersion): The mean and variance should be roughly equal. If variance is much greater than the mean (overdispersion), Negative Binomial Regression might be better.
- Independence of Observations: Claims from one customer should not influence claims from another.
- Exponential Relationship: The logarithm of the mean count is a linear function of predictors.
- No Excess Zeros (Zero-Inflation): Too many zero claims may indicate a zero-inflated Poisson (ZIP) model is needed.

##### Data Considerations
Before modeling, we must explore and preprocess the data.
- Important Features for Predicting Claims
    - Customer Age (older customers may file more claims)
    - Previous Claims History (customers with past claims may be riskier)
    - Vehicle Type (luxury cars may have more claims)
    - Driving Record (poor records suggest higher claims)
    - Policy Type (comprehensive vs. basic coverage)
    - Region (urban vs. rural)

In [None]:
# Step 1: Import Necessary Libraries
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
import seaborn as sns

# Step 2: Load and Explore Data
# Simulate an insurance dataset for Poisson regression.

# Simulating Data
np.random.seed(42)
n = 1000  # Number of customers

# Simulating features
age = np.random.randint(18, 75, size=n)  # Age of customers
past_claims = np.random.poisson(1.2, size=n)  # Past claims
vehicle_type = np.random.choice([0, 1], size=n, p=[0.7, 0.3])  # 0 = Regular, 1 = Luxury
region = np.random.choice([0, 1], size=n, p=[0.6, 0.4])  # 0 = Rural, 1 = Urban

# Generate number of claims (Outcome variable) using Poisson function
claims = np.random.poisson(lam=np.exp(0.02 * age + 0.5 * past_claims + 0.8 * vehicle_type + 0.3 * region - 3))

# Create DataFrame
df = pd.DataFrame({'Age': age, 'PastClaims': past_claims, 'VehicleType': vehicle_type, 'Region': region, 'NumClaims': claims})

# Show Data
print(df.head())

# Distribution Plot
sns.histplot(df['NumClaims'], bins=10, kde=False)
plt.xlabel("Number of Claims")
plt.ylabel("Frequency")
plt.title("Distribution of Claims")
plt.show()

# Step 3: Fit the Poisson Regression Model
# Fit Poisson Regression Model
model = smf.glm('NumClaims ~ Age + PastClaims + VehicleType + Region', data=df, family=sm.families.Poisson()).fit()

# Model Summary
print(model.summary())

# Exponentiated Coefficients (Rate Ratios)
np.exp(model.params)

##### Interpreting the Output
Intercept ($𝛽_0$): 
- Baseline log-count when all predictors are 0.

Coefficients:
- Log-change in expected claims per unit increase in the predictor.

Exponentiated Coefficients (Rate Ratios):
- Values > 1: Increase in claims
- Values < 1: Decrease in claims

### Key Checks:
- If NumClaims has too many zeros, consider a **zero-inflated model** (below).
- If variance > mean, **check for overdispersion** (below).

In [None]:
# Step 4: Model Diagnostics
print("Mean of Claims:", df['NumClaims'].mean())
print("Variance of Claims:", df['NumClaims'].var())

# Goodness-of-Fit Tests
print("Deviance:", model.deviance)
print("Pearson Chi-Square:", model.pearson_chi2)

# Step 5: Make Predictions
# Predict the expected number of claims for new customers.
# Creating new data for prediction
new_data = pd.DataFrame({'Age': [30, 50], 'PastClaims': [1, 3], 'VehicleType': [0, 1], 'Region': [1, 0]})

# Predict Expected Claims
new_data['PredictedClaims'] = model.predict(new_data)
print(new_data)

# Interpretation: These are the expected mean number of claims for new customers.

##### What to Look Out For in the Data
- Zero-inflation: Too many customers with zero claims? Consider Zero-Inflated Poisson (ZIP).
- Overdispersion: If variance > mean, use Negative Binomial Regression.
- Influential Data Points: Check for outliers using Cook’s distance.
- Multicollinearity: High correlation among features? Use Variance Inflation Factor (VIF).

#### Key takeaways for Poisson Regression model:
- Best for?: 
    - Count data (e.g., number of claims)
- Key Assumptions: 
    - Mean ≈ Variance, these are Independent events
- If Overdispersion?: 
    - Use Negative Binomial
- If Too many zeros?: 
    - Consider Zero-Inflated Poisson
- Predictions
    - Expected claim counts, log-linear model

### **Zero-Inflated Poisson (ZIP) Modeling**

Why ZIP?
- The Zero-Inflated Poisson (ZIP) model is used when count data has excessive zeros—more than what a standard Poisson regression can handle. 
- This often happens when:
    - Some customers never file claims (structural zeros).
    - Others have random chances of filing claims.

The ZIP model combines:
- A Poisson Model (for regular counts).
- A Logistic Model (to predict if a zero comes from a "never claims" group).

##### ZIP Model Structure
The ZIP model assumes:
$$ P(Y = 0) = \pi + (1 - \pi) e^{-\lambda}$$
$$ P(Y = k) = (1 - \pi)\frac{e^{-\lambda} \lambda^{k}}{k!} , k > 0$$
- where:
    - π = Probability of always being in the zero-claim group.
    - λ = Expected number of claims for those who can file claims.

The model has:
- A logistic model for zero-inflation probability π.
- A Poisson model for expected counts λ.

##### When to Use ZIP?
1. Check for Excess Zeros:
    - If too many zeros (>40%), ZIP is worth trying.
2. Compare Mean & Variance
    - If variance >> mean, ZIP or Negative Binomial may be needed.

In [None]:
# Check for Excess Zeros
zero_count = (df['NumClaims'] == 0).sum()
zero_percentage = zero_count / len(df)
print(f"Zero Percentage: {zero_percentage:.2f}")

# Compare Mean & Variance
print("Mean of Claims:", df['NumClaims'].mean())
print("Variance of Claims:", df['NumClaims'].var())

##### Fitting a ZIP Model in Python
We use statsmodels’ ZeroInflatedPoisson.

In [None]:
# Step 1: Install Dependencies
!pip install statsmodels

# Step 2: Import Libraries
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
import seaborn as sns

# Step 3: Simulate an Insurance Dataset
np.random.seed(42)
n = 1000  

# Simulating features
age = np.random.randint(18, 75, size=n)  
past_claims = np.random.poisson(1.2, size=n)  
vehicle_type = np.random.choice([0, 1], size=n, p=[0.7, 0.3])  
region = np.random.choice([0, 1], size=n, p=[0.6, 0.4])  

# Generate structural zeros (some never claim)
never_claims = np.random.binomial(1, 0.3, size=n)  
claims = np.random.poisson(lam=np.exp(0.02 * age + 0.5 * past_claims + 0.8 * vehicle_type + 0.3 * region - 3))

# Apply zero-inflation
claims[never_claims == 1] = 0  

# Create DataFrame
df = pd.DataFrame({'Age': age, 'PastClaims': past_claims, 'VehicleType': vehicle_type, 'Region': region, 'NumClaims': claims})

# Show Data
print(df.head())

# Plot distribution
sns.histplot(df['NumClaims'], bins=10, kde=False)
plt.xlabel("Number of Claims")
plt.ylabel("Frequency")
plt.title("Distribution of Claims")
plt.show()

# Step 4: Fit ZIP Model
import statsmodels.discrete.count_model as cm

# Fit ZIP Model
zip_model = cm.ZeroInflatedPoisson.from_formula("NumClaims ~ Age + PastClaims + VehicleType + Region", 
                                                exog_infl="Age + VehicleType", 
                                                data=df).fit()

# Model Summary
print(zip_model.summary())

Interpreting Results

Poisson Component (Regular Counts)
- Age, PastClaims, VehicleType, Region influence claim counts.
- Exponentiate coefficients (np.exp(zip_model.params)) for rate ratios.

Zero-Inflation Component (Logistic Model)
- Age, VehicleType influence whether someone never claims.
- High coefficients mean more structural zeros.

In [None]:
# Step 5: Compare Poisson vs. ZIP
## Method 1: Log-Likelihood Comparison
print("Poisson Log-Likelihood:", model.llf)  
print("ZIP Log-Likelihood:", zip_model.llf)
# If ZIP has a higher log-likelihood, it fits better.

## Method 2: Vuong Test for ZIP vs. Poisson
from statsmodels.stats.diagnostic import poisson_vuong
test_stat, p_value = poisson_vuong(model, zip_model)
print(f"Vuong Test Statistic: {test_stat}, p-value: {p_value}")
# If p < 0.05, ZIP is preferred.

# Step 6: Make Predictions
# Creating new customer data
new_data = pd.DataFrame({'Age': [30, 50], 'PastClaims': [1, 3], 'VehicleType': [0, 1], 'Region': [1, 0]})

# Predict expected claims
new_data['PredictedClaims'] = zip_model.predict(new_data)
print(new_data)
# Interpretation: The expected number of claims accounts for excess zeros.


##### What to Look Out For
Issue: Too many zeros?
- Solution: ZIP Model

Issue: Overdispersion?
- Solution: Negative Binomial ZIP

Issue: Unusual logistic coefficients?
- Solution: Check multicollinearity

##### When Not to Use ZIP?
- If zero inflation is low, use regular Poisson.
- If variance >> mean, use Negative Binomial ZIP.

##### Key takeways for ZIP Model
- Best for?
    - Count data with excess zeros
- Two Parts?
    - Logistic (excess zeros) + Poisson (counts)
- Key Assumptions?
    - Some customers never file claims
- Alternatives?
    - Negative Binomial ZIP for overdispersion
- Evaluation?
    - Vuong Test, Log-Likelihood

### **Checking for Overdespation**
How to Check for Overdispersion in Count Data
- Overdispersion occurs when the variance of the count data is greater than its mean. In count data models like Poisson regression, the assumption is that the mean equals the variance. When this assumption is violated, overdispersion is present.

##### Checking for Overdispersion**
**1. Check Mean vs. Variance for Overdispersion**
- Poisson Model Assumption: Variance = Mean
- Overdispersion: Variance > Mean

You can easily check this by calculating the mean and variance of your outcome variable (e.g., number of claims) and compare them.
- If variance > mean, we have overdispersion, and a Negative Binomial Model may be needed.

In [None]:
mean_claims = df['NumClaims'].mean()
variance_claims = df['NumClaims'].var()

print(f"Mean: {mean_claims}")
print(f"Variance: {variance_claims}")

**2. Overdispersion Test: Pearson Residuals**

Another way to check for overdispersion is by using Pearson residuals. The formula for Pearson residuals is:
$$ r_i = \frac{y_i - \hat{\mu_i}}{\sqrt{\hat{\mu_i}}}$$
- where 
    - $𝑦_𝑖$: is the observed value, and 
    - $\hat{𝜇_𝑖}$: is the predicted mean.

For a Poisson model, Pearson residuals are assumed to have a variance of 1. 
- If the residuals are larger than expected, it suggests overdispersion.
- If the statistic is greater than 1, overdispersion is likely.

In [None]:
# Fit a Poisson model
poisson_model = smf.poisson('NumClaims ~ Age + PastClaims + VehicleType + Region', data=df).fit()

# Calculate Pearson residuals
pearson_residuals = poisson_model.resid_pearson

# Calculate the overdispersion statistic
overdispersion_stat = (pearson_residuals**2).sum() / len(df)

print(f"Overdispersion Statistic: {overdispersion_stat}")


**3. Dispersion Test Using the Scale Parameter**
- The dispersion parameter for the Poisson model is assumed to be 1. 
    - If the dispersion parameter is significantly greater than 1, it indicates overdispersion.
    
Goodness-of-Fit Tests:
- If Deviance / df > 1, check for overdispersion.

In [None]:
# Check scale parameter (for Poisson model)
dispersion_parameter = poisson_model.scale
print(f"Dispersion Parameter: {dispersion_parameter}")


### **Handling Overdispersion**
- If overdispersion is detected, use Negative Binomial Regression.
    - Negative Binomial allows for overdispersion and typically improves model fit.

If overdispersion is detected, Poisson regression may not be suitable, and you should consider other models, such as:
- Negative Binomial Regression: It accounts for overdispersion by introducing a second parameter to model the variance.
- Zero-Inflated Models: If overdispersion is combined with excessive zeros, ZIP (Zero-Inflated Poisson) or ZINB (Zero-Inflated Negative Binomial) models are better suited.

In [None]:
# Fit Negative Binomial Model
neg_bin_model = smf.glm('NumClaims ~ Age + PastClaims + VehicleType + Region', data=df, family=sm.families.NegativeBinomial()).fit()
# Model Summary
print(neg_bin_model.summary())

## Example 2

# Fit Negative Binomial model
negbinom_model = smf.poisson('NumClaims ~ Age + PastClaims + VehicleType + Region', data=df).fit(method='newton')
print(negbinom_model.summary())

### **4.4. Negative Binomial Model**
Objective Predict the number of visits to a hospital per year.

Negative Binomial regression model is used when count data exhibits overdispersion, meaning the variance is greater than the mean. 
- This typically occurs in cases where the data is more spread out than what a Poisson model would predict.

##### Negative Binomial Regression Overview
- The Negative Binomial (NB) regression model is an extension of the Poisson regression model that adds a **dispersion parameter** to handle overdispersion. 
- This allows the model to account for situations where the variance exceeds the mean, which is a common occurrence in real-world count data.

Best for?
- Overdispersed count data (Variance > Mean)
Key Assumptions?
- Overdispersed data
What to check?
- Mean vs. Variance, Pearson Residuals, Dispersion Parameter
How to compare?
- Compare Log-Likelihood of Poisson vs. Negative Binomial
Alternatives?
- If overdispersion is not present, Poisson regression

##### Steps for Analyzing Negative Binomial Regression for Predicting Hospital Visits
Data contain variables such as:
- age, 
- previous visits, and 
- chronic conditions 
which may influence the number of hospital visits.

In [None]:
# Step 1:  Simulate the Dataset
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf

# Simulating data for hospital visits
np.random.seed(42)
n = 1000  # Number of observations

# Simulate features
age = np.random.randint(18, 80, size=n)
previous_visits = np.random.poisson(2, size=n)  # Previous hospital visits
chronic_conditions = np.random.choice([0, 1], size=n, p=[0.7, 0.3])  # Binary (0 = no, 1 = yes)

# Generate the number of hospital visits per year
# Use a log link for the expected value and include the effects of age, previous visits, and chronic conditions
log_mu = 0.02 * age + 0.5 * previous_visits + 0.8 * chronic_conditions
lambda_visits = np.exp(log_mu)

# Overdispersion: adding a random component to variance
visits = np.random.negative_binomial(n=2, p=1 / (1 + lambda_visits))  # Overdispersed count data

# Create DataFrame
df = pd.DataFrame({
    'Age': age,
    'PreviousVisits': previous_visits,
    'ChronicConditions': chronic_conditions,
    'HospitalVisits': visits
})

# Show the first few rows of the dataset
print(df.head())

# Plotting the distribution of hospital visits
sns.histplot(df['HospitalVisits'], kde=False, bins=10)
plt.title('Distribution of Hospital Visits per Year')
plt.xlabel('Number of Visits')
plt.ylabel('Frequency')
plt.show()

# Step 2: Visualizing the Data
# look at the mean vs. variance to check if overdispersion is present.
mean_visits = df['HospitalVisits'].mean()
variance_visits = df['HospitalVisits'].var()

print(f"Mean of Hospital Visits: {mean_visits}")
print(f"Variance of Hospital Visits: {variance_visits}")
# If Variance > Mean, this suggests that overdispersion is present, and the Negative Binomial model may be appropriate.

# Step 3: Fit a Negative Binomial Regression Model
# Fit Negative Binomial Model
nb_model = smf.poisson('HospitalVisits ~ Age + PreviousVisits + ChronicConditions', data=df).fit(method='newton')
print(nb_model.summary())

# Step 4: Checking Overdispersion:
# Verify whether overdispersion exists by looking at the dispersion parameter and Pearson residuals.
# (a) Pearson Residuals Test:
# Calculate Pearson residuals
pearson_residuals = nb_model.resid_pearson

# Calculate the overdispersion statistic
overdispersion_stat = (pearson_residuals**2).sum() / len(df)

print(f"Overdispersion Statistic: {overdispersion_stat}")

# (b)  Dispersion Parameter Test:
# Check the dispersion parameter for the Negative Binomial model
dispersion_parameter = nb_model.scale
print(f"Dispersion Parameter: {dispersion_parameter}")

# Step 5: Comparing with Poisson Model:
# ompare the Negative Binomial model to a Poisson model to see which one fits
# Fit Poisson Model for comparison
poisson_model = smf.poisson('HospitalVisits ~ Age + PreviousVisits + ChronicConditions', data=df).fit()

# Compare Log-Likelihood values
print(f"Poisson Model Log-Likelihood: {poisson_model.llf}")
print(f"Negative Binomial Model Log-Likelihood: {nb_model.llf}")

# Step 6: Making Predictions
# Use the Negative Binomial model to predict the number of hospital visits for new patients.
# New data for prediction
new_data = pd.DataFrame({
    'Age': [30, 60],
    'PreviousVisits': [2, 5],
    'ChronicConditions': [0, 1]
})

# Predict hospital visits
new_data['PredictedVisits'] = nb_model.predict(new_data)
print(new_data)

In step 3:
- Poisson distribution with the Negative Binomial method: Here, the Poisson model is used but estimated using the Negative Binomial method to account for overdispersion. 
- This will give us the correct estimates even when the data has variance greater than the mean.

In Step 4:
- (a) If the overdispersion statistic > 1, it confirms that overdispersion is present.
- (b) If dispersion parameter > 1, overdispersion is likely.

In Step 5:
- A higher log-likelihood for the Negative Binomial model indicates it better fits the data, especially in the case of overdispersion.

In step 6:
- This will output the predicted number of hospital visits per year for new data points.

### Model Diagnostics for Negative Binomial Regression
When working with count data models like the Negative Binomial (NB) regression, it's crucial to assess the model's 
- performance, 
- assumptions, and 
- potential issues. 

Here are some essential diagnostics:

##### **Residual Analysis for Negative Binomial Model**
- Residuals represent the difference between the observed values and the predicted values. 
- In regression, we use residuals to check how well the model fits the data. 
- For count data, there are a few specific types of residuals to consider:
    - Pearson Residuals: Measures the standardized difference between the observed and predicted values.
    - Deviance Residuals: Similar to Pearson residuals, but based on the likelihood of the model.

**Pearson Residuals Plot**

Interpretation:
- In a well-fitted model, the residuals should be centered around zero and roughly normally distributed.
- If there's a pattern in the residuals, it may indicate model misspecification or a need for more predictors.
- In the QQ plot, points should lie roughly on a straight line if the residuals are normally distributed.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Calculate Pearson residuals
pearson_residuals = nb_model.resid_pearson

# Plot residuals
sns.histplot(pearson_residuals, kde=True, bins=20)
plt.title("Histogram of Pearson Residuals")
plt.xlabel("Residuals")
plt.ylabel("Frequency")
plt.show()

# Check for normality in residuals
sns.qqplot(pearson_residuals, line='45')
plt.title("QQ Plot of Pearson Residuals")
plt.show()

**Deviance Residuals Plot**

Interpretation: 
- Deviance residuals provide insight into how much each observation deviates from the fitted model. 
    - They are particularly useful for identifying outliers or influential data points.

In [None]:
# Calculate Deviance residuals
deviance_residuals = nb_model.resid_deviance

# Plot deviance residuals
sns.histplot(deviance_residuals, kde=True, bins=20)
plt.title("Histogram of Deviance Residuals")
plt.xlabel("Residuals")
plt.ylabel("Frequency")
plt.show()

##### **Influence Diagnostics for Negative Binomial Model**
In regression models, influential points are data points that can significantly affect the estimated coefficients. 
- Influence measures help us identify these points. 
- For count data models, a useful diagnostic is the Cook’s Distance.

**Cook’s Distance Plot**

Cook’s distance is a metric that combines the leverage and residual of each observation to assess its influence on the fitted model.

Interpretation:
- Points with Cook’s Distance greater than the threshold (typically 4/n where n is the number of observations) are considered influential.
- These points should be examined carefully, as they could disproportionately affect the model fit.

In [None]:
# Calculate leverage (influence measure)
influence = nb_model.get_influence()
cooks_d = influence.cooks_distance[0]

# Plot Cook's Distance
plt.scatter(range(len(cooks_d)), cooks_d)
plt.axhline(y=4/len(df), color='r', linestyle='--')
plt.title("Cook's Distance")
plt.xlabel("Observation Index")
plt.ylabel("Cook's Distance")
plt.show()

**Leverage vs. Residuals Plot**

A leverage vs. residuals plot can also help identify influential points.

Interpretation:
- Points with high leverage and large residuals may be influential and should be checked for potential data issues.
- The red line marks the zero residuals; observations above this line might have larger residuals than expected.

In [None]:
# Get leverage and residuals
leverage = influence.hat_matrix_diag
residuals = nb_model.resid_pearson

# Plot leverage vs residuals
plt.scatter(leverage, residuals)
plt.axhline(y=0, color='r', linestyle='--')
plt.title("Leverage vs. Residuals")
plt.xlabel("Leverage")
plt.ylabel("Residuals")
plt.show()

### **Zero-Inflated Negative Binomial (ZINB)**
In many count datasets, we may observe excessive zeros in the outcome variable. A standard Negative Binomial model might not handle this well, as it assumes the count variable follows a particular distribution that may not fit the excess zeros.

The Zero-Inflated Negative Binomial (ZINB) model is specifically designed to handle datasets with an overrepresentation of zeros.

##### ZINB Model Structure:
- Zero-Inflation Component: A logistic regression (binary model) that models the probability of a zero count outcome.
    - Looks at the probability of an excess zero count (whether the person has 0 hospital visits or not).
- Count Component: A Negative Binomial model for the count data, which models the non-zero counts.
    - Models the non-zero counts using Negative Binomial regression.

##### Fitting a Zero-Inflated Negative Binomial (ZINB) Model
- We can fit a ZINB model using the statsmodels or pyGAM library. Here's how to do it in Python:

In [None]:
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.genmod.families import NegativeBinomial
from statsmodels.genmod.families.links import Logit

# Fit a Zero-Inflated Negative Binomial Model (ZINB)
# The 'infl' part specifies the logistic regression for the zero-inflation component.
zinb_model = smf.poisson('HospitalVisits ~ Age + PreviousVisits + ChronicConditions', 
                         data=df).fit(start_params=None, 
                                      method='newton', 
                                      family=NegativeBinomial(), 
                                      link=Logit())

# Model summary
print(zinb_model.summary())

##### Diagnostics for ZINB
Once the ZINB model is fitted, you can apply similar residual and influence diagnostics:
- Pearson residuals and Deviance residuals for both components.
- Cook's Distance and Leverage for influential data points.

Residual Analysis:
- Plot Pearson and Deviance residuals.
- Check normality (QQ plot) and patterns (histogram).

Influence Diagnostics:
- Identify influential points using Cook’s Distance and Leverage.

Zero-Inflated Models:
- Consider Zero-Inflated Negative Binomial (ZINB) if there are excessive zeros in the data.
- Evaluate model fit with Log-Likelihood comparison between Negative Binomial and ZINB.

##### **Comparison Between ZINB and NB Models**
You can compare the ZINB model with the standard Negative Binomial model to see if the zero-inflation component significantly improves the fit.

**Compare Log-Likelihoods**
- A significantly higher log-likelihood for the ZINB model would indicate that the zero-inflation component provides a better fit to the data.

In [None]:
# Compare Log-Likelihoods
print(f"Negative Binomial Log-Likelihood: {nb_model.llf}")
print(f"Zero-Inflated NB Log-Likelihood: {zinb_model.llf}")

### Analyzing Predictive Performance of the Negative Binomial and Zero-Inflated Models
To evaluate the predictive performance of a Negative Binomial (NB) model or Zero-Inflated Negative Binomial (ZINB) model, we rely on a range of metrics that help us understand how well the model is making predictions. Common metrics include 
- Root Mean Squared Error (RMSE), 
- Akaike Information Criterion (AIC), and 
- log-likelihood.

We'll explain their significance, and show how to calculate them in Python.

##### **Root Mean Squared Error (RMSE)**
RMSE is a standard metric used to assess how well a model's predictions match the actual observed values. It gives a sense of how far off the predictions are from the true values on average, with lower values indicating better performance.

The formula for RMSE is:
$$ RMSE = \sqrt{\frac{1}{n} \sum^n_{i=1} (y_i - \hat{y_i})^2}$$
- Where:
    - $y_i$ = actual values
    - $ \hat{y_i}$ = predicted values
    - n = number of observations

Calculating RMSE in Python
- First, let's assume you already have the fitted model (nb_model for Negative Binomial or zinb_model for Zero-Inflated Negative Binomial) and the actual data (df['hospital_visits']).

Interpretation: 
- The RMSE provides an idea of how much error there is between the predicted and actual values. Smaller RMSE values indicate better model performance. For count data, the RMSE is useful for evaluating how close the predicted counts are to the observed counts.

In [None]:
import numpy as np
from sklearn.metrics import mean_squared_error

# Actual values (hospital visits)
y_true = df['hospital_visits']

# Predicted values (model predictions)
y_pred = nb_model.predict()  # Or zinb_model.predict()

# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_true, y_pred)

# Calculate RMSE
rmse = np.sqrt(mse)
print(f"RMSE: {rmse}")

##### **Akaike Information Criterion (AIC)**
AIC is a metric used to compare different models based on their likelihood and the number of parameters. It helps assess model fit while penalizing for overfitting. Lower AIC values indicate a better-fitting model.

The formula for AIC is:
$$ AIC = 2k - 2ln(\hat{L})$$
- Where:
    - 𝑘 = number of model parameters
    - $\hat{𝐿}$ = maximum likelihood estimate of the model

Calculating AIC in Python
- You can directly get the AIC from the model summary. However, for demonstration, we’ll calculate it manually using the log-likelihood and the number of parameters.

Interpretation: 
- Lower AIC values indicate a model that fits the data well without overfitting. You can compare AIC across different models (e.g., Negative Binomial vs. Zero-Inflated Negative Binomial) to choose the best one.

In [None]:
# Log-Likelihood of the fitted model
log_likelihood = nb_model.llf  # Or zinb_model.llf

# Number of parameters in the model
num_params = len(nb_model.params)  # Or zinb_model.params

# Calculate AIC
aic = 2 * num_params - 2 * log_likelihood
print(f"AIC: {aic}")

##### **Log-Likelihood (LL)**
The log-likelihood measures how well the model fits the data. It is based on the likelihood function and is often used in conjunction with AIC for model comparison.

Log-Likelihood in Python
- You can obtain the log-likelihood directly from the fitted model:

Interpretation: 
- Higher log-likelihood values indicate a better fit. However, the log-likelihood should not be compared across models with different numbers of parameters unless corrected by AIC or BIC.

In [None]:
log_likelihood = nb_model.llf  # Or zinb_model.llf
print(f"Log-Likelihood: {log_likelihood}")

### Comparing Predictive Performance Using RMSE, AIC, and Log-Likelihood
Once you've calculated RMSE, AIC, and log-likelihood, you can compare different models to decide which one is better at predicting the outcome.

Example Comparison of Negative Binomial and Zero-Inflated Negative Binomial
Let's say you have two models: nb_model (Negative Binomial) and zinb_model (Zero-Inflated Negative Binomial). Here’s how to compare them:

Interpretation:
- RMSE: Lower RMSE means the model's predictions are closer to the actual data.
- AIC: The model with the lowest AIC is generally considered the best fit.
- Log-Likelihood: Higher log-likelihood indicates better model fit.


In [None]:
# Get predictions from both models
y_pred_nb = nb_model.predict()
y_pred_zinb = zinb_model.predict()

# Calculate RMSE for both models
rmse_nb = np.sqrt(mean_squared_error(y_true, y_pred_nb))
rmse_zinb = np.sqrt(mean_squared_error(y_true, y_pred_zinb))

# AIC for both models
aic_nb = 2 * len(nb_model.params) - 2 * nb_model.llf
aic_zinb = 2 * len(zinb_model.params) - 2 * zinb_model.llf

# Log-Likelihood for both models
log_likelihood_nb = nb_model.llf
log_likelihood_zinb = zinb_model.llf

# Print the results
print(f"Negative Binomial - RMSE: {rmse_nb}, AIC: {aic_nb}, Log-Likelihood: {log_likelihood_nb}")
print(f"Zero-Inflated Negative Binomial - RMSE: {rmse_zinb}, AIC: {aic_zinb}, Log-Likelihood: {log_likelihood_zinb}")

### **4.5. Gamma Regression for Predicting Transaction Processing Time**
Gamma regression is used for modeling positive continuous response variables, particularly when the variance increases with the mean (heteroscedasticity). This makes it suitable for predicting time taken to process a transaction, where larger transactions might have more processing time variability.

##### **Understanding Gamma Regression**

Why Gamma Regression?
It is used when the dependent variable is continuous and strictly positive.
The variance of the response variable increases with the mean.
The Gamma distribution is right-skewed, making it suitable for processing time data.

Model Form
$$ E(Y|X) = g^{-1}(X\beta)$$
- where:
    - Y = processing time (response variable)
    - $g^{-1}$: inverse of the link function (commonly log)
    - $X\beta$: linear predictor (features)
    - The response variable follows a Gamma distribution.

##### Key Assumptions
- Gamma-distributed response variable (positive, right-skewed).
- Mean-variance relationship: The variance of Y increases with the mean.
- Link function: Often log link, ensuring predicted values remain positive.

##### Data Preparation
Assume we have a dataset df containing:
- transaction_size (amount in USD)
- num_steps (number of steps in processing)
- processing_time (time in seconds)

If processing_time is right-skewed, Gamma regression is appropriate.

In [None]:
# Step 1: Checking the Data
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load data
df = pd.read_csv("transaction_data.csv")  # Hypothetical dataset

# Summary
print(df.describe())

# Check skewness
sns.histplot(df['processing_time'], bins=30, kde=True)
plt.title("Distribution of Processing Time")
plt.show()

# Step 2: Fitting a Gamma Regression Model
# We use GLM (Generalized Linear Model) with a Gamma family and a log link function.
import statsmodels.api as sm
import statsmodels.formula.api as smf

# Define the Gamma regression model
gamma_model = smf.glm("processing_time ~ transaction_size + num_steps",
                       data=df, family=sm.families.Gamma(link=sm.families.links.log())).fit()

# Model summary
print(gamma_model.summary())

# transaction_size and num_steps are predictors.

# Step 3: Model Diagnostics
# 1. Residual Analysis
# Gamma regression assumes a mean-variance relationship. Checking residuals ensures model appropriateness.
# Plot residuals
sns.residplot(x=gamma_model.fittedvalues, y=gamma_model.resid_response, lowess=True)
plt.xlabel("Fitted Values")
plt.ylabel("Residuals")
plt.title("Residual Plot")
plt.show()
# If residuals show a clear pattern, transformation or an alternative model may be needed.

# 2. Deviance Residuals
# Deviance residuals
sns.histplot(gamma_model.resid_deviance, bins=30, kde=True)
plt.title("Histogram of Deviance Residuals")
plt.show()
# Normality is not required, but extreme skewness may indicate overdispersion.

# Step 4: Model Evaluation
# 1. Root Mean Squared Error (RMSE)
# Measures the average prediction error.
import numpy as np
from sklearn.metrics import mean_squared_error

# Actual vs Predicted
y_true = df['processing_time']
y_pred = gamma_model.predict()

# Compute RMSE
rmse = np.sqrt(mean_squared_error(y_true, y_pred))
print(f"RMSE: {rmse}")
# Lower RMSE → Better predictions.

# 2. Akaike Information Criterion (AIC)
# AIC balances goodness of fit with model complexity.
aic = gamma_model.aic
print(f"AIC: {aic}")
# Lower AIC → Better model.

# 3. Log-Likelihood
# Measures model fit.
log_likelihood = gamma_model.llf
print(f"Log-Likelihood: {log_likelihood}")
# Higher log-likelihood → Better model.

##### Step 5: Comparison with Alternative Models
If the Gamma model shows issues, alternative models include:

Log-Transformed OLS:
- If variance does not follow a Gamma structure, log-transforming processing_time and using linear regression may work.

Inverse Gaussian Regression:
- Suitable for positive skewed data with heavy tails. 

##### Comparing RMSE and AIC Across Models
- Select the model with the lowest AIC and RMSE.

In [None]:
# Step 5: Comparison with Alternative Models

# Alternative: Log-Transformed Linear Model
df['log_processing_time'] = np.log(df['processing_time'])
lm_model = smf.ols("log_processing_time ~ transaction_size + num_steps", data=df).fit()

# Compute RMSE
rmse_lm = np.sqrt(mean_squared_error(y_true, np.exp(lm_model.predict())))

# AIC values
aic_lm = lm_model.aic

# Print comparison
print(f"Gamma Model - RMSE: {rmse}, AIC: {aic}")
print(f"Log-Transformed OLS Model - RMSE: {rmse_lm}, AIC: {aic_lm}")

Interpreting Model Results
From the Gamma model summary:
- Coefficients (β values) indicate how much processing time changes with predictors.
- Exp(β) gives the percentage increase per unit increase of a predictor.

Interpretation
- If transaction_size has exp(β) = 1.05, a 1-unit increase increases processing time by 5%.

In [None]:
np.exp(gamma_model.params)

# Classification via Mathematics Functions

Classification Using the Equation of a Straight Line

Steps:

1. Begin with the Equation of a Line: The general equation of a straight line in a 2D plane is:

$$𝑦 = 𝑚 \times 𝑥 + 𝑐 $$

- m: Slope of the line (how steep it is)
- c: Intercept (where the line crosses the y-axis)

2. Connect it to Classification:
- In binary classification, the goal is to separate two classes (e.g., Class 0 and Class 1).
- The equation of a line can act as a decision boundary: 
    - points on one side of the line belong to Class 0, while 
    - points on the other side belong to Class 1.

3. Interactive Example: 
- Imagine a dataset with two features, $𝑥_1$ and $𝑥_2$

For simplicity:
- $𝑥_1$: Horizontal axis
- $𝑥_2$: Vertical axis

A simple decision boundary can be represented as:
$$ 𝑥_2 = 𝑚 \times 𝑥_1 + 𝑐 $$

**How the Slope (m) and Intercept (c) Influence the Boundary**

The slope and intercept determine the orientation and position of the decision boundary in the feature space.

- Slope (m):
    - Controls the steepness or angle of the line.
    - A larger absolute value of m means the line is steeper; a smaller absolute value means it is flatter.
    - Example: In $ 𝑥_2 = 𝑚 \times 𝑥_1 + 𝑐 $
        - If m > 0, the line slopes upward.
        - If m < 0, the line slopes downward.
        - If m = 0, the line is horizontal.
- Intercept (c):
    - Determines where the line crosses the $𝑥_2$ (vertical) axis.
    - Changing c shifts the line up or down without changing its slope.
    - Example: If c=1, the line crosses the x_2 axis at 1.

Together, m and c define how the decision boundary separates the feature space. Adjusting these values can change which points fall into Class 0 or Class 1.

4. Decision Boundary in Classification: Modify the equation to reflect classification logic:
$$ 𝑥_2 - 𝑚 \times 𝑥_1 - 𝑐 = 0 $$

- Points where this equation equals 0 lie exactly on the line.
- Points where $ 𝑥_2 - 𝑚 \times 𝑥_1 - 𝑐 > 0 $ belong to Class 1.
- Points where $ 𝑥_2 - 𝑚 \times 𝑥_1 - 𝑐 < 0 $ belong to Class 0.

5. Visualization: Plot this line on a 2D plane with some example data points:
- Red points for Class 0
- Blue points for Class 1
- The line $𝑦 = 𝑚 \times 𝑥 + 𝑐 $ separates the two clesses

6. Extend to Higher Dimensions: In higher dimensions, the decision boundary becomes a hyperplane:

$$ w_1𝑥_1 + w_2𝑥_2 + ... + w_n𝑥_n + b = 0 $$

- Where: 
    - $w_1, w_2, ..., w_n$ are weights (equivalent to slopes) and
    - $𝑏$ is the intercept.

**What Happens When the Data Points Overlap Significantly?**

When data points from different classes overlap, the decision boundary may not cleanly separate the two classes, leading to misclassification. Here’s what happens:

Misclassification:
- Points from one class appear on the "wrong" side of the decision boundary.
- This results in a classification error (false positives or false negatives).

Impact on Model:
- A linear decision boundary (a straight line) may not be flexible enough to separate overlapping or complex distributions.
- Performance metrics like accuracy, precision, and recall can degrade.

Example Scenario:
- Consider a dataset where the two classes form concentric circles. A straight-line boundary cannot separate the classes, leading to significant misclassification.

**Transition from Linear to Non-Linear Decision Boundaries**

Linear decision boundaries work well when data is linearly separable. However, real-world data is often complex, requiring non-linear boundaries. Here’s how we transition:

Extend the Feature Space:
- Use techniques like polynomial features to introduce non-linear relationships.
    - where $𝑥_1$ and $𝑥_2$ can be transformed to:
        - $𝑥^2_1$ and $𝑥^2_2$
        - $𝑥_1 \times 𝑥_2$
- The linear classifier now operates in this transformed space, creating a non-linear boundary in the original feature space.

1. Kernel Methods (e.g., in SVMs):
- Apply kernel functions like RBF (Radial Basis Function) to map data into a higher-dimensional space where it is linearly separable.
- The decision boundary in the original space appears non-linear.

2. Neural Networks:
- Multi-layer perceptrons (MLPs) can learn complex, non-linear decision boundaries by stacking layers of non-linear activation functions.
- Neural networks are particularly powerful for high-dimensional and unstructured data.

3. Ensemble Models:
- Techniques like random forests or gradient boosting combine multiple weak learners to create flexible decision boundaries.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Example data
np.random.seed(0)
x1_class0 = np.random.rand(50)
x2_class0 = 2 * x1_class0 + 0.5 + np.random.normal(0, 0.1, 50)
x1_class1 = np.random.rand(50)
x2_class1 = 2 * x1_class1 - 0.5 + np.random.normal(0, 0.1, 50)

# Equation of line: x2 = m*x1 + c
m = 2  # slope
c = 0  # intercept
x_line = np.linspace(0, 1, 100)
y_line = m * x_line + c

# Plot
plt.figure(figsize=(8, 6))
plt.scatter(x1_class0, x2_class0, color='red', label='Class 0')
plt.scatter(x1_class1, x2_class1, color='blue', label='Class 1')
plt.plot(x_line, y_line, color='black', label='Decision Boundary')
plt.xlabel('x1')
plt.ylabel('x2')
plt.legend()
plt.title('Linear Decision Boundary for Classification')
plt.grid()
plt.show()

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_circles
from sklearn.svm import SVC
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline

# Generate non-linear dataset
X, y = make_circles(n_samples=300, noise=0.1, factor=0.5, random_state=0)

# Plot raw data
plt.figure(figsize=(6, 6))
plt.scatter(X[y == 0][:, 0], X[y == 0][:, 1], color='red', label='Class 0')
plt.scatter(X[y == 1][:, 0], X[y == 1][:, 1], color='blue', label='Class 1')
plt.title('Non-linear Data')
plt.legend()
plt.show()

# Linear decision boundary (fails for non-linear data)
linear_svm = SVC(kernel='linear', C=1)
linear_svm.fit(X, y)

# Non-linear decision boundary using kernel trick
nonlinear_svm = SVC(kernel='rbf', C=1, gamma=2)
nonlinear_svm.fit(X, y)

# Visualize decision boundaries
def plot_decision_boundary(clf, X, y, title):
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100), np.linspace(y_min, y_max, 100))
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    plt.contourf(xx, yy, Z, alpha=0.8, cmap='coolwarm')
    plt.scatter(X[y == 0][:, 0], X[y == 0][:, 1], color='red', edgecolor='k', label='Class 0')
    plt.scatter(X[y == 1][:, 0], X[y == 1][:, 1], color='blue', edgecolor='k', label='Class 1')
    plt.title(title)
    plt.legend()
    plt.show()

# Linear decision boundary
plot_decision_boundary(linear_svm, X, y, title='Linear Decision Boundary (Fails)')

# Non-linear decision boundary
plot_decision_boundary(nonlinear_svm, X, y, title='Non-linear Decision Boundary (Succeeds)')


### **Linear Discriminant Analysis (LDA)**
Linear Discriminant Analysis (LDA) is a classification technique that uses a linear combination of features to separate classes. 

It assumes:
- The data within each class is normally distributed.
- The covariance of each class is identical (homoscedasticity).

LDA works by finding a linear decision boundary that maximizes the separation between classes.

Goal:
- Project data onto a lower-dimensional space (usually 1D for binary classification).
- Maximize the distance between class means while minimizing the variance within each class.

**Calculating the best values for the parameters of a linear discriminant**
- Estimate the coefficients that define the linear decision boundary based on your dataset.
- In Linear Discriminant Analysis (LDA), these coefficients are derived by maximizing the separation between the means of the classes while minimizing the variance within each class.

Steps:
1. Define the Linear Discriminant Function

The linear discriminant function for binary classification can be written as:

$$ y = w_0 + w_1𝑥_1 + w_2𝑥_2 + ... + w_d𝑥_d $$

- where:
    - $w_0$: Intercept (bias term).
    - $w_1, w_2, ..., w_d$: Coefficients for each feature $x_1, x_2, ..., x_d$
    - y: The decision score. A threshold is applied to classify points.

2. Estimate Class Statistics

To compute the parameters, you first need the following statistics from the data:

- Compute Class Means ($\mu_0 and \mu_1): Calculate the mean vector for each class.
    - for each class $C_0 and C_1$
$$ \mu_k = \frac{1}{N_k} \sum_{x \in C_i} x $$
- where 
    - $N_k$ is the number of instances in class k.

- Compute Pooled Covariance Matrix ($𝑆_𝑤$):
    - Within-Class Scatter Matrix ($𝑆_𝑤$): Measures the spread of points within each class.
$$ S_w = \sum^{[c]}_{i = 1} \sum_{x \in C_i} (x - \mu_i)(x - \mu_i)^T $$
    - Divide by the total number of samples (N) to get the pooled covariance matrix.
- Compute Between-Class Scatter Matrix ($𝑆_b$): Measures the separation between class means.
$$ S_b = \sum^{[c]}_{i = 1} N_i (\mu_i - \mu)(\mu_i - \mu)^T $$
- where 
    - $N_i$ is the number of samples in class i
    - $\mu$ is the overall mean.

- Prior Probabilities (P($C_0$) and P($C_1$)):
    - These are the proportions of each class in the dataset:
$$ P(C_k) = \frac{N_k}{N} $$

3. Compute the Parameters

- Find Optimal Projection: Solve the eigenvalue problem for $S^{-1}_w S_b$, and select the eigenvector with the largest eigenvalue.
    - The parameters of the discriminant function are calculated as follows:
        - Linear Coefficients ($w$):
$$ w = S^{-1}_w (\mu_1 - \mu_0)$$
- where:
    - $S^{-1}_w$ is the inverse of the pooled covariance matrix.

        - Intercept($w_0$):
$$ w_0 = -\frac{1}{2} (\mu^T_1 S^{-1}_w \mu_1 - \mu^T_0 S^{-1}_w \mu_0 ) + ln \frac{P(C_1)}{P(C_0)}$$

4. Predict Class Labels (Decision Rule:)

- Project data points onto the linear discriminant.
    - For a new instance 𝑥, compute the linear discriminant score:
$$ y = w_0 + w^T x$$

- Use a threshold (e.g., midpoint between means) to classify points.
- Classify based on the threshold (usually $y> 0 \Rightarrow C_1 , y \leq 0 \Rightarrow C_0$ )

**Interpretation of Parameters**

Linear Coefficients (𝑤):
- Feature Weight Represent?
    - A feature weight (coefficient) indicates the change in the predicted outcome associated with a unit change in the feature, keeping all other features constant.
    - These determine how much each feature contributes to the decision boundary.

- A larger magnitude of $𝑤_𝑖$ means the corresponding feature $𝑥_𝑖$ has more influence.
    -  Interpreting the Magnitude (Absolute Magnitude):
        - Larger Magnitudes: Indicate that a feature has a stronger effect on the outcome.
        - Smaller Magnitudes: Suggest that the feature has less influence on the outcome.
    - Interpreting the Magnitude (Positive or Negative Sign)
        - Positive Weight: Indicates a positive relationship between the feature and the outcome.
        - Negative Weight: Indicates a negative relationship between the feature and the outcome.

- Impact of Scaling on Magnitudes
    - Feature magnitudes are meaningful only if the features are on the same scale. If features differ in scale:
        - Larger scales will lead to larger coefficients, even if the feature has less relative importance.
        - Standardizing or normalizing the features (e.g., using z-scores or min-max scaling) ensures that coefficient magnitudes are comparable.

Linear Regression Example
- Model: Predict house price (y) using square footage ($𝑥_1$) and number of bedrooms ($𝑥_2$):

$$ y = w_0 + w_1𝑥_1 + w_2𝑥_2 $$
$$ y = 50 + 300𝑥_1 + 10000𝑥_2 $$

- Interpretation:
    - $w_1$ = 300: Increasing square footage by 1 unit increases the house price by R300.
    - $w_2$ = 10,000: Adding one bedroom increases the house price by R10,000.

1. Interpretation of Weight Magnitude in Logistic Regression
- In logistic regression, weights do not directly represent the change in the outcome but the log-odds of the outcome.

$$ log(\frac{P(y = 1)}{P(y = 0)}) = w_0 + w_1𝑥_1 + w_2𝑥_2 + ... + w_d𝑥_d $$

- Exponentiated coefficients ($e^{w_i}$) indicate the multiplicative effect on the odds for a unit change in $x_i$

Logitic Regression Example
- Model: Predict customer churn (y) based on monthly charges ($𝑥_1$) and contract length ($𝑥_2$):

$$ log(\frac{P(y = 1)}{P(y = 0)}) = w_0 + w_1𝑥_1 + w_2𝑥_2 $$
$$ log(\frac{P(y = 1)}{P(y = 0)}) = -3 + 0.05𝑥_1 + 2𝑥_2 $$

- Interpretation:
    - $w_1$ = 0.05: for every R1 increase in monthly charges, the log-odds of churn increase by 0.05.
    - $w_2$ = 10,000: For each additional month of contract length, the log-odds of churn increase by 2.

2. Interpretation of Weight Magnitude in Regularized Models (Lasso and Ridge)
- Coefficients may be shrunk or set to zero based on regularization strength, which impacts their magnitude.
- Regularization ensures that larger weights correspond to truly important features.

Intercept ($𝑤_0$):
- Adjusts the position of the decision boundary.

Decision Rule:
- $y>0$: Class 1.
- $y≤0$: Class 0.

Considerations for interpretations

Multicollinearity:
- If features are highly correlated, the magnitude of weights can become unstable and misleading.
    - Techniques like Variance Inflation Factor (VIF) or regularization can mitigate this.

Standardization:
- Always standardize features to ensure meaningful comparisons between coefficients.

Model-Specific Meaning:
- Interpretations vary slightly across linear regression, logistic regression, and other models.
    - In logistic regression, remember that coefficients affect the log-odds, not the raw probabilities.

**Interpretation of Results**

Confusion Matrix and Classification Report:
- The confusion matrix indicates true positives, true negatives, false positives, and false negatives.
- The classification report shows metrics like precision, recall, F1-score, and accuracy.

Decision Boundary:
- The plot shows the LDA decision boundary, which is linear. 
    - It separates the two classes by maximizing the ratio of between-class variance to within-class variance.
- Data points on either side of the boundary are classified into their respective classes.

**Assumptions and Limitations:**

Assumptions:
- Classes have a normal distribution.
- Classes share the same covariance matrix.

Limitations:
- LDA struggles with non-linear boundaries or when the assumptions of normality and homoscedasticity are violated.

**When to Use LDA**

Advantages:
- Works well when the data satisfies its assumptions.
- Provides interpretable results with clear decision boundaries.

Use Cases:
- Medical diagnosis (e.g., distinguishing between disease states).
- Marketing (e.g., classifying customer preferences).
- Text classification (when transformed into vector space).

Not Suitable:
- When classes are non-linearly separable (use non-linear methods like quadratic discriminant analysis or kernel methods in such cases).

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

# Generate synthetic data
X, y = make_classification(n_samples=200, n_features=2, n_informative=2, n_redundant=0,
                           n_clusters_per_class=1, class_sep=2, random_state=42)

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Visualize the data
plt.figure(figsize=(8, 6))
plt.scatter(X[y == 0][:, 0], X[y == 0][:, 1], color='red', label='Class 0')
plt.scatter(X[y == 1][:, 0], X[y == 1][:, 1], color='blue', label='Class 1')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Synthetic Dataset')
plt.legend()
plt.show()

# Apply LDA
lda = LDA()
lda.fit(X_train, y_train)

# Predictions
y_pred = lda.predict(X_test)

# Model evaluation
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

# Visualize decision boundary
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100), np.linspace(y_min, y_max, 100))
Z = lda.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.figure(figsize=(8, 6))
plt.contourf(xx, yy, Z, alpha=0.8, cmap='coolwarm')
plt.scatter(X_test[y_test == 0][:, 0], X_test[y_test == 0][:, 1], color='red', edgecolor='k', label='Class 0')
plt.scatter(X_test[y_test == 1][:, 0], X_test[y_test == 1][:, 1], color='blue', edgecolor='k', label='Class 1')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('LDA Decision Boundary')
plt.legend()
plt.show()


In [None]:
import numpy as np

# Example data: two classes, each with 2 features
class_0 = np.array([[2, 3], [3, 4], [4, 5]])
class_1 = np.array([[6, 8], [7, 9], [8, 10]])

# Combine data and calculate class statistics
X = np.vstack([class_0, class_1])
y = np.array([0] * len(class_0) + [1] * len(class_1))

# Calculate class means
mu_0 = np.mean(class_0, axis=0)
mu_1 = np.mean(class_1, axis=0)

# Calculate within-class scatter matrix
S_w = np.zeros((X.shape[1], X.shape[1]))
for xi in class_0:
    S_w += np.outer(xi - mu_0, xi - mu_0)
for xi in class_1:
    S_w += np.outer(xi - mu_1, xi - mu_1)

# Calculate linear coefficients
w = np.linalg.inv(S_w).dot(mu_1 - mu_0)

# Calculate intercept
prior_0 = len(class_0) / len(X)
prior_1 = len(class_1) / len(X)
intercept = -0.5 * (mu_1.T @ np.linalg.inv(S_w) @ mu_1 - mu_0.T @ np.linalg.inv(S_w) @ mu_0) + np.log(prior_1 / prior_0)

# Display results
print("Linear Coefficients (w):", w)
print("Intercept (w0):", intercept)

# Predict for a new sample
sample = np.array([5, 6])
decision_score = intercept + w.T.dot(sample)
prediction = 1 if decision_score > 0 else 0
print("Prediction for sample {}: Class {}".format(sample, prediction))

### Optimizing the Objective Function in a Linear Discriminant Model

Objective of a Linear Discriminant Analysis (LDA) model is to:
- Find a linear combination of features that best separates two or more classes. This is achieved by 
    - optimizing an objective function that 
        - maximizes the separation between classes while 
        - minimizing the spread (variance) within each class.

##### **Objective Function of LDA**
The objective function in LDA is based on two key matrices:
1. Between-Class Variance ($S_B$):
    - Measures the separation between the class means.
    - Defined as:
$$ S_B = \sum^k_{i= 1} n_i (\mu - \mu)(\mu_i - \mu)^T$$

- where:
    - k:Number of classes.
    - $n_i$: Number of instances in class i
    - $\mu_i$: Mean vector of class i
    - $\mu$: Overall mean vector.

2. Within-Class Variance ($S_W$): 
- Measures the spread of data points within each class.
Defined as:
$$ S_B = \sum^k_{i= 1} \sum^k_{x \in C_1} (\mu - \mu)(\mu_i - \mu)^T$$

- where:
    - $𝐶_𝑖$ represents all instances belonging to class i.

The objective function to optimize in LDA is:

$$ J(w) = \frac{w^T S_B w}{w^T S_W w}$$

- Where:
    - w : is the weight vector that defines the linear discriminant

##### **Optimizing the Objective Function**
To maximize J(w):
1. Solve the generalized eigenvalue problem:
$$ S^-1_W S_Bw = \lambda w$$

- Where:
    - $\lambda$ is the eigenvalue and 
    - w is the eigenvector.

2. Select the eigenvector corresponding to the largest eigenvalue $\lambda_1$,  as it maximizes the class separation.

3. For multiclass problems, select the top k-1 eigenvectors (for k classes) to project data into a lower-dimensional space with maximum discrimination. 

### Scoring and Ranking Instances
Once the linear discriminant function is computed, it can be used to score and rank instances as follows:

Scoring:
- The discriminant score for an instance x is calculated as:

$$ y = w^T x + b$$

- Where
    - w is the optimized weight vector.
    - b is the intercept (bias term).
    - y is the scalar discriminant score.

- The score indicates how far x lies from the decision boundary:
    - Positive scores suggest the instance is likely to belong to one class.
    - Negative scores suggest the instance is likely to belong to the other class.

Ranking:
- Instances can be ranked based on their discriminant scores y.
    - Larger absolute scores indicate greater confidence in classification.
    - Instances closer to zero are near the decision boundary, indicating uncertainty.

##### Practical Example

Given Dataset
Suppose you have two classes (Class A and Class B) and two features $x_1, x_2$

Steps to Optimize and Use the Objective Function:
1. Compute Class Means:
- $\mu_A and \mu_B$ are the mean vectors for Class A and Class B.
- $\mu$ is the overall mean.

2. Compute Variance Matrices:
- Calculate $S_B, S_W$

3. Solve for w:
- Find the eigenvector corresponding to the largest eigenvalue of $S^-1_W S_b$

4. Calculate Scores:
- For each instance $x_i$, compute the discriminant score:
$$ y_i = w^T x_i + b$$

5. Rank Instances:
- Sort instances by their discriminant scores to rank them by their likelihood of belonging to a specific class

__________________

Disclaimer: `decision_function()` method comes from specific machine learning models in libraries like scikit-learn, and it is used to compute the distance of a sample to the decision boundary in classification tasks. 
- This function is particularly useful in models that rely on decision boundaries, such as 
    - Linear Discriminant Analysis (LDA), 
    - Support Vector Machines (SVM),
    - Logistic Regression.

What Does `decision_function()` Return?

Binary Classification (2 Classes):
- Returns a 1D array of scores where each score indicates the distance of the instance from the decision boundary.
    - Positive scores suggest one class (e.g., Class 1), and negative scores suggest the other class (e.g., Class 0).

Multiclass Classification (More than 2 Classes):
- Returns a 2D array of scores (one score per class for each instance).
    - The classifier assigns a class label based on the highest score.

Why Use decision_function()?
- To understand the confidence of predictions.
- To enable custom ranking or thresholding based on discriminant scores.
- To analyze how far instances are from the boundary, providing insight into borderline cases.

Use `decision_function()` in Ranking and Thresholding

- Ranking: Instances can be ranked by their scores. 
    - Higher absolute values indicate greater confidence in classification.
- Thresholding: The decision scores can be used to apply custom thresholds to refine classification decisions.

Decision Boundary

In the case of Linear Discriminant Analysis:
- The decision boundary corresponds to where the decision_function() outputs 0.
- This boundary is a hyperplane that separates the feature space into regions corresponding to each class.

In [None]:
import numpy as np
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

# Example data
X = np.array([[2, 3], [3, 5], [5, 7], [6, 8], [8, 10], [9, 12]])  # Features
y = np.array([0, 0, 0, 1, 1, 1])  # Labels (0: Class A, 1: Class B)

# Fit LDA model
lda = LDA()
lda.fit(X, y)

# Compute discriminant scores
scores = lda.decision_function(X)

# Print scores and rankings
print("Discriminant Scores:", scores)
print("Ranking of Instances:", np.argsort(-scores))  # Descending order


Interpretation of Scores and Rankings

Discriminant Scores:
- Positive scores suggest membership in Class 1.
- Negative scores suggest membership in Class 0.

Rankings:
- Instances with higher absolute scores are ranked higher, as the classifier is more confident in their classification.

Decision Boundary:
- The boundary is where the discriminant score y = 0

### Analyzing the relationship between the distance from the decision boundary of a linear discriminant and the likelihood of response
Helps us understanding how confident the model is in its predictions.

- The distance from the decision boundary (discriminant score) relates directly to the confidence in classification.
- Scores are transformed into posterior probabilities using logistic (binary) or softmax (multiclass) functions.
- These probabilities are interpretable as the likelihood of response and can be used for scoring, ranking, and applying thresholds for decision-making.

##### Theoretical Relationship : Distance from the decision boundary

The decision boundary in a Linear Discriminant Analysis (LDA) separates classes by 
- maximizing the distance between class means while 
- minimizing variance within each class. 

The discriminant score $y = w^T x + b$ represents the signed distance of an instance x from the decision boundary:
- Positive scores indicate the instance is classified into one class (e.g., Class 1).
- Negative scores indicate the instance is classified into the other class (e.g., Class 0).

The magnitude of the score reflects the confidence in classification:
- Larger absolute values imply that the instance is far from the decision boundary and thus more confidently classified.
- Smaller absolute values (near zero) indicate that the instance is close to the boundary, suggesting uncertainty.

##### Likelihood of Response
In LDA, we can link the discriminant score to the posterior probability of a class, which represents the likelihood of the instance belonging to that class:

$$ P(C_k | x) = \frac{exp(y_k)}{\sum^K_{j = 1} exp(y_i)} $$

- Where: 
    - $P(C_k | x)$: is the posterior probability for class k.
    - $ y_k = w^T x_i + b_k$: is the discriminant score for class k.
    - The denominator is the normalization factor across all classes K.

The posterior probability serves as a soft classification metric:
- Probabilities closer to 1 indicate high confidence.
- Probabilities closer to 0.5 (in a binary classification) indicate uncertainty.

#####  Practical Example
Let’s calculate the relationship between discriminant scores and posterior probabilities for a **Binary classification**.

Example Dataset

Suppose we have a binary classification problem with discriminant scores:
$$y=[2.0,0.5,0.0,−0.5,−2.0]$$

We can compute the posterior probabilities using the logistic function:
$$ P(C_1 | x) = \frac{1}{1 + exp(-y)} $$

_______________

Generalization to **Multiclass Classification**
- In multiclass problems, the discriminant scores $𝑦_𝑘$ are normalized using the softmax function to compute posterior probabilities for each class:

$$ P(C_k | x) = \frac{exp(y_k)}{\sum^K_{j = 1} exp(y_i)} $$

- And the class with the highest posterior probability is the predicted class.

In [None]:
import numpy as np

# Discriminant scores
scores = np.array([2.0, 0.5, 0.0, -0.5, -2.0])

# Compute posterior probabilities using the logistic function
posterior_probabilities = 1 / (1 + np.exp(-scores))

# Print results
print("Scores:", scores)
print("Posterior Probabilities:", posterior_probabilities)


Interpretation of Results

1. Scores Far from Zero:
- y=2.0: High confidence in Class 1 ($P(C_1∣x)=0.88$).
- y=−2.0: High confidence in Class 0 ($P(C_0∣x)=0.88$).

2. Scores Near Zero:
- y=0.0: The posterior probability is 0.5, indicating complete uncertainty.

3. Intermediate Scores:
- y=0.5: Moderately confident in Class 1 ($P(C_1∣x)=0.62$).
- y=−0.5: Moderately confident in Class 0 ($P(C_0∣x)=0.62$).

Insights

Distance and Likelihood:
- Instances farther from the boundary (large ∣y∣) have posterior probabilities close to 0 or 1, indicating higher confidence in classification.
- Instances near the boundary (y≈0) have probabilities close to 0.5, indicating uncertainty.

Scoring and Ranking:
- By sorting instances based on posterior probabilities, you can rank them in terms of likelihood of response (e.g., likelihood of belonging to Class 1).

Visualization
To better understand the relationship, plot the discriminant score against the posterior probability:

In [None]:
import matplotlib.pyplot as plt

# Plot scores vs posterior probabilities
plt.plot(scores, posterior_probabilities, marker='o')
plt.axvline(0, color='gray', linestyle='--', label='Decision Boundary')
plt.title('Discriminant Score vs Posterior Probability')
plt.xlabel('Discriminant Score (y)')
plt.ylabel('Posterior Probability')
plt.legend(['Scores', 'Decision Boundary'])
plt.grid()
plt.show()

### Understanding Decision Boundaries in Depth

Decision boundaries are surfaces (lines, planes, or hypersurfaces) that separate data points into different classes in a feature space.
- These boundaries are derived based on the decision rules of a classifier, and they indicate the regions where the classifier predicts different outcomes.

##### Decision Boundaries in 2D
In 2D space, the decision boundary is a 
- line (for linear classifiers) or a 
- curve (for non-linear classifiers).

1. Linear Decision Boundaries
- For a binary classification problem, a linear decision boundary is represented as:

$$ w_0 + w_1𝑥_1 + w_2𝑥_2 = 0 $$

- where:
    - $ w_1, w_2$ are the coefficients.
    - $ 𝑥_1 , 𝑥_2$ are the features.
    - $ w_0 $  is the intercept.

Separating red and blue points in a 2D space, the decision boundary is a straight line. Points on one side belong to one class, while points on the other side belong to the other class.

2. Non-Linear Decision Boundaries
- For complex data distributions, non-linear classifiers create curved decision boundaries. These are:
    - SVM with kernel trick or 
    - neural networks 
- Example: A circular boundary might separate inner and outer regions in a concentric circle dataset.

Visualization
- The boundary is typically visualized by plotting the equation in 2D space and showing the classification regions.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression

# Generate 2D data
X, y = make_classification(n_samples=200, n_features=2, n_informative=2, n_redundant=0, class_sep=1.5, random_state=42)
model = LogisticRegression()
model.fit(X, y)

# Create grid for decision boundary
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100), np.linspace(y_min, y_max, 100))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)

# Plot data and decision boundary
plt.contourf(xx, yy, Z, alpha=0.8, cmap='coolwarm')
plt.scatter(X[:, 0], X[:, 1], c=y, edgecolor='k', cmap='coolwarm')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('2D Decision Boundary')
plt.show()

##### Decision Boundaries in 3D
In 3D space, the decision boundary becomes a 
- plane.

1. Linear Decision Boundaries
- Represented as:

$$ w_0 + w_1𝑥_1 + w_2𝑥_2 + w_3𝑥_3= 0 $$

- where:
    - $ w_1, w_2,  w_3 $ are the coefficients.
    - $ 𝑥_1 , 𝑥_2, 𝑥_3$ are the features.
    - $ w_0 $  is the intercept.

For the features, the plane separates the feature space into two regions for classification.

2. Non-Linear Decision Boundaries

- Non-linear models define curved surfaces in 3D space
    - spheres, - parabolas.
- Example: In 3D, the boundary might look like a bowl separating one region (inside the bowl) from another (outside the bowl).

Visualization
- Visualizing a plane or curved surface in 3D is possible with tools like Matplotlib's 3D plotting. It shows how the boundary divides the space.


In [None]:
from mpl_toolkits.mplot3d import Axes3D
from sklearn.svm import SVC

# Generate 3D data
X = np.random.rand(200, 3)
y = (X[:, 0] + X[:, 1] > 1).astype(int)  # Linear decision boundary
model = SVC(kernel='linear')
model.fit(X, y)

# Create grid for decision boundary
xx, yy = np.meshgrid(np.linspace(0, 1, 50), np.linspace(0, 1, 50))
zz = (-model.intercept_[0] - model.coef_[0][0] * xx - model.coef_[0][1] * yy) / model.coef_[0][2]

# Plot data and decision boundary
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=y, cmap='coolwarm', edgecolor='k')
ax.plot_surface(xx, yy, zz, alpha=0.5, color='gray')
ax.set_xlabel('Feature 1')
ax.set_ylabel('Feature 2')
ax.set_zlabel('Feature 3')
ax.set_title('3D Decision Boundary')
plt.show()

##### Decision Boundaries in Higher Dimensions
In higher-dimensional spaces, the decision boundary becomes a 
- hyperplane 
- more complex hypersurface.

1. Linear Decision Boundaries
- For a d-dimensional feature space, the equation is:
$$ w_0 + w_1𝑥_1 + w_2𝑥_2 + w_3𝑥_3 + ... + w_d𝑥_d = 0 $$
- This hyperplane divides the d-dimensional space into regions for classification.

In a 4D feature space $𝑥_1, 𝑥_2, 𝑥_3, 𝑥_4$, the decision boundary is a 3D hyperplane.

2. Non-Linear Decision Boundaries
- Non-linear models use transformations (e.g., polynomial, kernel tricks) to create non-linear hypersurfaces.
- These hypersurfaces can separate data points that are non-linearly separable in their original feature space.

Visualization
- Direct visualization becomes challenging beyond 3 dimensions. 
- However, techniques like dimensionality reduction (PCA, t-SNE, UMAP) can project high-dimensional data and decision boundaries into 2D or 3D for interpretation.

**Impact of Dimension on Decision Boundaries**

Curse of Dimensionality:
- As dimensions increase, data points become sparse, making classification harder.
- Models like LDA or logistic regression may underperform without feature selection.

Model Complexity:
- Non-linear decision boundaries require more complex models (e.g., SVM with RBF kernels, neural networks).
- Overfitting is a significant risk in high dimensions.

### **6. Support Vector Machines (SVM)**
What It Means: 
- SVMs classify data by finding the best “boundary” (hyperplane) that separates classes with the widest possible margin.

Outcome Interpretation: 
- Data points on either side of the boundary belong to different classes, with "support vectors" helping to define the boundary.

Performance Measures:
- Accuracy: Proportion of correct classifications.
- Precision and Recall: Used when classes are imbalanced; precision is the correctness of positive predictions, and recall measures coverage.

Lay Explanation: 
- SVMs are like drawing a line to separate different groups, ensuring the groups are as distinct as possible with the help of a few key points.

Use Case: 
- Used for classification and regression in high-dimensional spaces, often for non-linearly separable data.

### **Support Vector Machines (SVM): Key Idea**

The idea behind SVM is to find the optimal hyperplane that best separates data points of different classes in the feature space.
- This basic idea of the SVM is to separate points using a $(p - 1)$ dimensional hyperplane. 

What does it mean to separate points? 
- This means that the SVM will construct a decision boundary such that points on the left are assigned a label of $A$ and points on the right are assigned a label of $B$.  
- When finding this separating hyperplane we wish to maximise the distance of the nearest points to the hyperplane. 
    - The technical term for this is **maximum separating hyperplane**.
- The data points which dictate where the separating hyperplane goes are called **support vectors**.

How It works in laymans terms:

Pretend that you want to classify data points into group $A$ or group $B$. An SVM will plot your labelled training data as points in space and will:
- look for the widest, clearest gap between points belonging to group A and points belonging to group B. 
- It will then use this newly identified dividing line (known as a hyperplane) and the margin around it to classify new observations. 
- An unseen data point will be classified into group A or B depending on which side of the margin it is closest to. 

##### Important Concepts in SVM
1. Hyperplane:
- A decision boundary that separates classes in the feature space.
    - In 2D, it’s a line; 
        -  when your data only has 2 features. You only need a simple one-dimensional decision boundary (which is basically a line) to classify the data.
        - line only has one dimension
    - In 3D, it’s a plane; 
    - In higher dimensions, it’s a hyperplane.
        - more features get added the line needs to take on more dimensions,
        - 4 or more dimensions
        - In SVM, the hyperplane will always have one less dimension ($-1$) than the number of input features ($p$), or a total of $(p-1)$ dimensions.

2. Margin:
- The distance between the hyperplane and the closest data points (called support vectors) of either class.
    - SVM maximizes this margin to create the most robust separation.

3. Support Vectors:
- The data points closest to the hyperplane, which influence its position and orientation.
4. Optimal Hyperplane:
- The hyperplane that maximizes the margin while correctly classifying the training data (or minimizing misclassifications).

Support Vector Machines in a nutshell:
- Like logistic regression, SVMs fit a linear decision boundary. 
- Unlike logistic regression, SVMs do this in a non-proabilistic way and are able to fit to non-linear data using an algorithm known as the [kernel trick](https://en.wikipedia.org/wiki/Kernel_method).

SVMs can be used for both classification and regression. In `sklearn`, these are called:
- `SVC` (Support Vector Classifier)
- `SVR` (Support Vector Regression) 

SVC can also refer to Support Vector **Clustering**

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score

import warnings
warnings.filterwarnings("ignore")

##### Generate synthetic data
Generate a random dataset to experiment with.
- by taking a multi-dimensional **standard normal distribution** and defining classes separated by nested concentric multi-dimensional spheres such that roughly equal numbers of
samples are in each class (quantiles of the $\chi^2$ distribution).
    - generated a donut shaped dataset, where 
        - the samples belonging to one class are generally located in the centre
        - the samples belonging to the other class are generally located in the outer ring.


##### **Reasons for Normalizing Data in SVMs**
- because of how the SVM algorithm calculates margins and distances between data points.

SVM is Sensitive to Feature Scales
- SVM relies on calculating distances (e.g., Euclidean distance) between data points to determine margins and support vectors. 
    - If one feature has a much larger range than others, it will dominate the distance calculation, leading to biased results.
- Example: In a dataset with two features—age (ranging from 0 to 100) and income (ranging from 0 to 100,000)—income will heavily influence the decision boundary, even if age is equally or more important.

Ensures Proper Margins
- The SVM objective is to find the hyperplane that maximizes the margin between classes. 
    - Without normalization, the margin calculation may become skewed, resulting in suboptimal or incorrect decision boundaries.
- Example: If one feature has a larger scale, the margin might stretch disproportionately along that dimension, ignoring other features.

Improves Kernel Performance
- SVMs often use kernels (e.g., RBF, polynomial) to project data into higher dimensions. 
    - Kernels are sensitive to the relative scaling of features. Normalization ensures that all features contribute equally to the projection.
- Example: An RBF kernel requires well-scaled data to compute meaningful similarity measures between points. Poorly scaled data may lead to ineffective kernel computations.

Reduces Convergence Time
- SVM optimization involves iterative calculations that are influenced by feature scaling. 
    - Normalized data leads to faster and more stable convergence of the optimization algorithm.
- Example: When features are on drastically different scales, the optimization problem may take longer to converge or fail to converge entirely.

Handles Non-linear Decision Boundaries Better
- Why? For non-linear kernels (like RBF), the distance between points in feature space influences the shape of the decision boundary. 
    - Normalization ensures these distances are meaningful, leading to smoother and more accurate decision boundaries.

##### **Consequences of Not Normalizing**
- Poor Decision Boundaries: The SVM may create biased or incorrect hyperplanes, reducing model performance.
- Misclassification: The model may misclassify data, especially when features with large ranges dominate.
- Kernel Inefficiency: Kernels may fail to project the data effectively, leading to poor separation of classes.
- Increased Training Time: Optimization takes longer, impacting the efficiency of training.

##### **How to Normalize Data for SVMs**

1. **Standardization**: Subtract the mean and divide by the standard deviation for each feature
$$ z = (\frac{x - \mu}{\sigma})$$
- This scales features to have a mean of 0 and a standard deviation of 1.

2. **Min-Max Scaling**: Rescale each feature to a fixed range, typically [0, 1] 
$$ z = (\frac{x - min(x)}{max(x)}$$


In [None]:
from sklearn.datasets import make_gaussian_quantiles

# Set the feature dimensionality
p = 2

# Construct the dataset
X, y = make_gaussian_quantiles(cov=3.,
                                 n_samples=1000, n_features=p,
                                 n_classes=2, random_state=1)

In [None]:
# get training and testing data 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

##### Fit a SVM classifier with a linear decision boundary
We are going to fit an SVC model with a `linear kernel`. This means that we are telling the SVC to fit the data using a linear decision boundary. Let's also take a look at the accuracy score:

In [None]:
from sklearn.svm import SVC

svc = SVC(kernel='linear')
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)

print("The accuracy score of the SVC is:", accuracy_score(y_test, y_pred))
print("\n\nClassification Report:\n\n", classification_report(y_test, y_pred))

In [None]:
from sklearn.svm import SVC
model = SVC()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

##### Plot the decision boundary for the SVC
When accuracy score doesn't seem very good. To help us understand what's going on use: Visualisation.

The SVC calculates and implements a $p-1$ dimensional decision boundary (hyperplane) over the input features.
- Since we are only looking at 2 features (our synthetic dataset only has two features, or $p=2$), our hyperplane will only have 1 dimension ($p-1$)
    - look like a single line.
- if your model has more than 2 features, you can plot the hyperplane for any 2 features you choose.

##### **Calculating the Dimensions of a separating Hyperplane**
The dimensions of a separating hyperplane depend on the number of features (or predictors) in the dataset.

Definition of a Hyperplane

A hyperplane in n-dimensional space is defined as:
$$ 𝑤 \times 𝑥 + 𝑏 = 0$$

where:
- $w = [w_1, w_2, ..., w_n]$: Weight vector normal to the hyperplane.
- $x = [x_1, x_2, ..., x_n]$: Feature vector of an instance.
- $b$: Bias term (offset from the origin).

The hyperplane separates data into two classes:
- $ 𝑤 \times 𝑥 + 𝑏 > 0$: Class 1
- $ 𝑤 \times 𝑥 + 𝑏 < 0$: Class 2

Dimensions of a Hyperplane

The dimensionality of the hyperplane is determined by the number of features in the dataset:
- If the dataset has n features, the hyperplane is an (n−1)-dimensional subspace.

Examples:
- 2 Features (2D): The hyperplane is a 1D line.
- 3 Features (3D): The hyperplane is a 2D plane.
- 4 Features (4D): The hyperplane is a 3D subspace (hard to visualize, but mathematically valid).

Intuition Behind Dimensions

- The hyperplane must divide the feature space into two regions corresponding to different classes.
- Higher dimensions mean more complex hyperplanes, allowing SVM to handle more intricate patterns.
- Kernels: When data is mapped to a higher-dimensional feature space using kernels (e.g., RBF), the hyperplane exists in the higher-dimensional space, though its exact dimensions depend on the kernel's transformation

When Are the Dimensions Relevant?

- At Training Time: The dimensions of the hyperplane are implicitly calculated when the SVM solves the optimization problem to find 𝑤 and b.
    - The optimization ensures the hyperplane maximizes the margin between support vectors of the two classes.
- During Prediction: The dimensionality of the hyperplane affects how data points are classified. The model computes:
    - Decision Function: $𝑤 \times 𝑥 + 𝑏$
        - The sign of this value determines the predicted class.

##### **Calculation of the Dimensions**
- The dimensions are calculated implicitly when the SVM solves its optimization problem to find 𝑤 and b.
- The dimensionality of the hyperplane is directly tied to the feature space in which the data resides.

Steps:
1. Input Data Dimension: Count the number of features n in your dataset.
- Example: If your dataset has features $x = [x_1, x_2, x_3]$ , it’s a 3-dimensional space.

2. Hyperplane Dimension: The hyperplane will have (n−1) dimensions.
- For the 3-feature example, the hyperplane is a 2D plane.

In this case, donut-shaped data is not `linearly separable`

In [None]:
i = 0 # Feature 1
j = 1 # Feature 2

svc.fit(X[:, [i, j]], y)
fig = plt.figure(figsize=(8, 8))
ax1 = fig.add_subplot(111)
 
x_min, x_max = X[:, i].min(), X[:, i].max()
y_min, y_max = X[:, j].min(), X[:, j].max()
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 1000), np.linspace(y_min, y_max, 1000))

y_hat = svc.predict(np.concatenate((xx.reshape(-1,1), yy.reshape(-1,1)), axis=1))
y_hat = y_hat.reshape(xx.shape)

ax1.pcolormesh(xx, yy, y_hat, cmap=plt.cm.get_cmap('RdBu_r'))
ax1.scatter(X[:, i], X[:, j], c=y, edgecolors='k', cmap=plt.cm.get_cmap('RdBu_r'))
ax1.set_xlabel('Feature 1')
ax1.set_ylabel('Feature 2')
ax1.set_xlim(xx.min(), xx.max())
ax1.set_ylim(yy.min(), yy.max())
ax1.set_xticks(())
ax1.set_yticks(())
plt.show()

Solution: Use of SVM's [kernel trick](https://en.wikipedia.org/wiki/Kernel_method) to use a **non-linear** decision boundary instead.

### Approaches to Find Non-linear Boundaries Using a Linear Model
It involves transforming the input features into a higher-dimensional space where the non-linear relationships between the features can be captured. 
- This approach is fundamental to models like Support Vector Machines (SVMs) and forms the basis of the kernel trick.

1. **Feature Transformation**
- Idea: Transform the original features into new features such that the transformed space allows a linear boundary to separate the classes.
- Example: Suppose you have two features, $x_1, x_2$, which cannot be separated linearly. You can transform these features into a higher-dimensional space:

$$ \phi (x_1, x_2) = (x_1, x_2, x^2_1, x^2_2,x_1, x_2) $$

- In this new feature space, the classes might become linearly separable.

2. **Adding Polynomial Features**
- Method: Include interaction terms or higher-order terms of the features.
- Example 1: for quadratic relationships
$$ \phi (x_1, x_2) = (x_1, x_2, x^2_1, x^2_2,x_1, x_2)$$

- Example 2: for cubic relationships

$$ \phi (x_1, x_2) = (x_1, x_2, x^2_1, x^2_2, x_1, x_2, x^3_1, x^3_2, x_1, x^2_2)$$

- Outcome: The linear model operates on these new features, effectively creating non-linear boundaries in the original space.

3. **Kernel Trick**
- Idea: Instead of explicitly transforming features into a higher-dimensional space, compute the dot product of the transformed features directly using a kernel function.
- Common Kernels:
    - Polynomial Kernel:
        - Allows non-linear boundaries based on polynomial relationships of degree 𝑑.
$$ K(x, x') = (<x, x'> + c)^d $$

- Radial Basis Function (RBF) Kernel:
    - Creates highly flexible decision boundaries that adapt to the data.
$$ K(x, x') = exp (\frac{||x, x'||^2}{2\sigma^2}) $$

- `It Works: The kernel trick avoids the computational cost of explicitly mapping features into high dimensions by working directly with the similarity (dot product) in the transformed space.`

4. **Piecewise Linear Boundaries**
- Idea: Create a decision boundary that is linear within small regions of the input space but collectively forms a non-linear boundary.
- Implementation:
    - Use decision trees or ensemble methods like random forests or gradient boosting to partition the input space into small regions.
    - Apply a linear model within each region.
- Outcome: The combined boundary is non-linear in the original feature space.

5. **Using Basis Functions**
- Idea: Use basis functions to project the features into a new space.
- Example: Gaussian basis functions
    - Where, Each basis function creates a feature that measures the proximity of x to a center $\mu$, allowing flexible, non-linear boundaries.

$$ \phi (x) = exp (- \frac{||x, \mu||^2}{2\sigma^2}) $$

6. **Regularization with Non-linear Effects**
- Idea: Use techniques like Lasso, Ridge, or Elastic Net regularization to automatically select higher-order features or interaction terms that help model non-linear boundaries.
- Benefit: Controls overfitting while allowing non-linear effects.

##### Advantages of Finding Non-linear Boundaries with Linear Models
- Avoids the need for explicitly training complex non-linear models.
- Exploits the mathematical simplicity of linear models for efficient computation.
- Kernel methods allow flexibility without explicit feature engineering.

##### Challenges
- Overfitting: High-dimensional transformations can lead to overfitting if not regularized.
- Computational Cost: Complex kernels (e.g., RBF) may increase computational time for large datasets.
- Interpretability: Non-linear boundaries are harder to interpret than linear ones.

By combining feature transformations, kernel tricks, and regularization, linear models can effectively handle non-linear relationships and create decision boundaries that adapt to complex data structures.

### Fit a SVC classifier with a non-linear decision boundary
The Support Vector Machine (SVM) with a Radial Basis Function (RBF) kernel, also known as the Gaussian kernel, is a powerful method for finding non-linear decision boundaries by mapping input features into a higher-dimensional space.

Using SVM with Radial Basis Function (RBF) Kernel (Gaussian kernel)
- Original Space: Two classes that overlap non-linearly cannot be separated by a straight line.
- Transformed Space: The RBF kernel projects the data into a higher-dimensional space where the overlap is reduced, and a hyperplane (linear boundary) can separate the classes.
- Decision Boundary in Original Space: The hyperplane in the transformed space translates into a non-linear boundary in the original space.

##### **How the RBF Kernel Works**

Radial Basis Function Definition: 
- The RBF kernel measures the similarity between two data points $x_i, x_j$
- The Gaussian RBF kernel is defined as:
$$ K(x, x') = exp (\frac{||x, x'||^2}{2\sigma^2}) $$
- where:
    - $||x, x'||^2$: Squared Euclidean distance between the two points.
    - $\sigma$: Kernel width parameter (controls the "spread" of the Gaussian function).
    - $K(x, x')$: Similarity measure (ranges from 0 to 1). A value close to 1 indicates high similarity.

Key Idea: 
- The RBF kernel implicitly maps the input data into an infinite-dimensional feature space where a linear decision boundary can separate the data. This allows the SVM to create highly flexible non-linear boundaries in the original space.

Role of Hyperparameters:
- C: Regularization parameter (controls the trade-off between maximizing the margin and minimizing classification error).
- $\gamma = \frac{1}{2\sigma^2}$: Kernel coefficient (controls how far the influence of a single data point reaches).

**Effect of gamma on RBF**:
- has a significant impact on the model's performance. 
- It determines how far the influence of a single data point extends, controlling the shape and flexibility of the decision boundary.

Small Gamma (e.g., 0.01):
- Wide Influence: Each support vector influences a large region of the feature space.
- Simpler Decision Boundary: The model creates smoother, less complex decision boundaries.
- Risk of Underfitting: If the gamma is too small, the model may fail to capture the complexity of the data, leading to poor performance.
- Example:
    - The decision boundary might be too broad to separate closely spaced data points.
    - Good for datasets with large-scale structures and fewer fine details.

Large Gamma (e.g., 100):
- Narrow Influence: Each support vector influences only a small region around itself.
- Complex Decision Boundary: The model creates highly detailed and intricate decision boundaries.
- Risk of Overfitting: If gamma is too large, the model may memorize the training data and generalize poorly to unseen data.
- Example:
    - The decision boundary might wrap tightly around individual data points, leading to poor generalization.

Visualizing the Effect of Gamma

Here is a comparison of decision boundaries with different gamma values:
- Low Gamma: Broad decision regions; the model captures global patterns but misses local variations.
- High Gamma: Sharp, detailed decision regions; the model captures local variations but may overfit noise in the data.

Optimal Gamma: Bias-Variance Tradeoff
- Small Gamma: High bias, low variance.
- Large Gamma: Low bias, high variance.
- The optimal gamma strikes a balance, allowing the model to generalize well without overfitting or underfitting.

Optimization: 
- The SVM finds the hyperplane in the transformed space that maximizes the margin while minimizing misclassifications (via hinge loss).

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.datasets import make_moons

# Generate synthetic data
X, y = make_moons(n_samples=300, noise=0.2, random_state=42)

# Visualize data
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='viridis', edgecolor='k')
plt.title("Dataset: Moons")
plt.show()

# Define gamma values to test
gamma_values = [0.01, 0.1, 1, 10]

# Plot decision boundaries for different gamma values
fig, axes = plt.subplots(1, len(gamma_values), figsize=(20, 5))

for i, gamma in enumerate(gamma_values):
    svm = SVC(kernel='rbf', gamma=gamma, C=1.0)
    svm.fit(X, y)
    
    # Decision boundary
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
                         np.linspace(y_min, y_max, 100))
    Z = svm.decision_function(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    axes[i].contourf(xx, yy, Z, levels=20, cmap='viridis', alpha=0.7)
    axes[i].scatter(X[:, 0], X[:, 1], c=y, cmap='viridis', edgecolor='k')
    axes[i].set_title(f"Gamma: {gamma}")

plt.tight_layout()
plt.show()


##### Use the rbf kernel (Radial_basis_function_kernel), which allows the SVC to fit a non-linear decision boundary from srcarch.

Steps to Use SVM with RBF Kernel
1. Choose Hyperparameters:
- Start with default values of
    - C
    - $\gamma$
- Use techniques like grid search or random search to tune hyperparameters.
2. Transform the Data:
- The RBF kernel implicitly transforms the data. No explicit feature mapping is required.
3. Fit the Model:
- Optimize the SVM objective function to find the best hyperplane in the transformed space.
4. Evaluate Performance:
- Use metrics like accuracy, precision, recall, and F1-score.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, accuracy_score

# Step 1: Generate a non-linear dataset
X, y = make_moons(n_samples=300, noise=0.2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Step 2: Visualize the dataset
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='viridis', edgecolor='k')
plt.title("Non-linear Dataset (Moons)")
plt.show()

# Step 3: Train an SVM with RBF Kernel
# Default hyperparameters
svm_rbf = SVC(kernel='rbf', C=1.0, gamma='scale', random_state=42)
svm_rbf.fit(X_train, y_train)

# Step 4: Evaluate the model
y_pred = svm_rbf.predict(X_test)
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Accuracy Score:", accuracy_score(y_test, y_pred))

# Step 5: Visualize the decision boundary
def plot_decision_boundary(model, X, y):
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
                         np.linspace(y_min, y_max, 100))
    Z = model.decision_function(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    plt.contourf(xx, yy, Z, levels=20, cmap='viridis', alpha=0.7)
    plt.scatter(X[:, 0], X[:, 1], c=y, cmap='viridis', edgecolor='k')
    plt.title("SVM with RBF Kernel - Decision Boundary")
    plt.colorbar()
    plt.show()

plot_decision_boundary(svm_rbf, X, y)

# Step 6: Hyperparameter tuning using GridSearchCV
param_grid = {
    'C': [0.1, 1, 10],
    'gamma': [0.01, 0.1, 1]
}

grid_search = GridSearchCV(SVC(kernel='rbf'), param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

print("Best Parameters:", grid_search.best_params_)
print("Best Cross-Validation Score:", grid_search.best_score_)

# Step 7: Evaluate the best model
best_model = grid_search.best_estimator_
y_pred_best = best_model.predict(X_test)

print("Classification Report (Best Model):\n", classification_report(y_test, y_pred_best))
print("Accuracy Score (Best Model):", accuracy_score(y_test, y_pred_best))

plot_decision_boundary(best_model, X, y)

##### Use Gaussian RBF for an Observation
For a single observation $x_i$ and a reference point $x_j$, the Gaussian RBF kernel is calculated as:

$$ K(x, x') = exp (-\gamma||x_i, x_j||^2) $$
- where:
    - $||x_i, x_j||^2$: is the squared Euclidean distance.
    - $\gamma$ controls the spread of the kernel 
        - small $\gamma$: considers distant points similar, 
        - large $\gamma$: considers only close points similar.

In [None]:
from sklearn.metrics.pairwise import rbf_kernel

# Compute the RBF kernel for two points
point1 = np.array([1.0, 2.0]).reshape(1, -1)
point2 = np.array([3.0, 4.0]).reshape(1, -1)
gamma = 0.5

rbf_value = rbf_kernel(point1, point2, gamma=gamma)
print("RBF Kernel Value between point1 and point2:", rbf_value[0][0])


##### Use the rbf kernel (Radial_basis_function_kernel), which allows the SVC to fit a non-linear decision boundary with Built in Library.

In [None]:
svc = SVC(kernel='rbf')
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)

print("The accuracy score of the SVC is:", accuracy_score(y_test, y_pred))
print("\n\nClassification Report:\n\n",classification_report(y_test, y_pred))

##### Plot the decision boundary for the SVC using the non-linear rbf kernel

Plot the 1 dimensional decision boundary between the 2 features present in our synthetic dataset:

In [None]:
i = 0 # Feature 1
j = 1 # Feature 2

svc.fit(X[:, [i, j]], y)
fig = plt.figure(figsize=(8, 8))
ax1 = fig.add_subplot(111)
 
x_min, x_max = X[:, i].min(), X[:, i].max()
y_min, y_max = X[:, j].min(), X[:, j].max()
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 1000), np.linspace(y_min, y_max, 1000))

y_hat = svc.predict(np.concatenate((xx.reshape(-1,1), yy.reshape(-1,1)), axis=1))
y_hat = y_hat.reshape(xx.shape)

ax1.pcolormesh(xx, yy, y_hat, cmap=plt.cm.get_cmap('RdBu_r'))
ax1.scatter(X[:, i], X[:, j], c=y, edgecolors='k', cmap=plt.cm.get_cmap('RdBu_r'))
ax1.set_xlabel('Feature 1')
ax1.set_ylabel('Feature 2')
ax1.set_xlim(xx.min(), xx.max())
ax1.set_ylim(yy.min(), yy.max())
ax1.set_xticks(())
ax1.set_yticks(())
plt.show()

### **Objective Function of SVM**
The objective of SVM is twofold:
- Maximize the margin (maximize separation between classes).
- Minimize classification errors for non-linearly separable data (using a hinge-loss function).

It is designed to maximize the margin between the two classes while minimizing classification errors. 
- For m features, the objective function considers the weight vector $w \in R^m$, which defines the orientation of the separating hyperplane in the feature space.

#####  Primal Form of the Objective Function: Mathematical Formulation
Given:
- A dataset with n training samples $(𝑥_𝑖, 𝑦_𝑖)$ where 
    - $𝑥_𝑖 \in 𝑅^𝑑$ are feature vectors
    - $y_𝑖 \in {−1,1}$ are class labels.
- A weight vector 𝑤 and bias b defining the hyperplane.

The decision boundary is represented by:
$$ f(x) = w^T x + b $$

The **SVM primal objective function** is:

$$ Minimize_{w,b,\xi}: \frac{1}{2} ||w||^2 + C \sum^n_{i=1} Hinge Loss$$

- Where:
    - $\frac{1}{2} ||w||^2$: Ensures the margin is maximized, promoting simplicity.
        - Encourages a large margin between the two classes (simpler decision boundary).
        - It minimizes the norm of the weight vector w, maximizing the margin between the classes.
    - $C$: Regularization parameter that controls the trade-off between margin maximization and classification errors.
        - The choice of C determines where this balance lies.
    - $\sum^n_{i=1} \xi_i$: A penalty for all margin violations. 
        - A higher sum of $\xi_i$ implies more violations.
    - $\xi$: Classification errors / Slack variables that represent misclassification or margin violations.
        - measure the extent of misclassification or margin violation for each data point.
        - It represents the extent to which the i-th data point violates the margin (misclassification or lying within the margin).


The constraints for correctly classified data points are / subject to:

$$ y_i  \cdot (w^T x_i + b) \geq 1  - \xi_i , \xi_i \geq 0 \forall i $$

##### Hinge-Loss Function

The hinge-loss function is used to penalize misclassifications and points close to the margin. It is defined as:

$$ Hinge Loss: L(y,f(x)) = max(0,1 - y  \cdot f(x)) $$

- If $y  \cdot f(x) \geq 1$, the loss is 0 (correctly classified and beyond the margin).
- If $y  \cdot f(x) < 1$, the loss increases linearly as the point moves closer to or across the margin.

he **SVM primal objective function** is now:

$$ Minimize_{w,b}: \frac{1}{2} ||w||^2 + C \sum^n_{i=1} max(0,1 - y_i  \cdot (w^T x_i + b))$$

### **Understanding the Objective Function**

1. Margin Maximization ($\frac{1}{2} ||w||^2$): 
- The first term ensures that the hyperplane has the largest margin by minimizing the norm of the weight vector ($||w||$).
    - A smaller ||w|| corresponds to a larger margin.

2. Hinge Loss ($C \sum^n_{i=1} max(0,1 - y  \cdot f(x))$)
- The second term $\sum^n_{i=1} \xi_i$ penalizes points that are misclassified or fall within the margin.
- $max(0,1 - y_i  \cdot (w^T x_i + b)$ penalizes points that are either misclassified or lie within the margin.
- The parameter C>0 is a `regularization parameter` that controls the trade-off between maximizing the margin and minimizing classification errors.
    - How C Works:
        - Large C: Strongly penalizes misclassifications, leading to a tighter fit to the training data.
        - Small C: Allows for more margin violations, leading to a simpler, more generalizable model.

Regularization parameter trade-off:
- Regularization adjusts the balance between two objectives:
    - Maximizing the margin:
        - Maximizing the margin: Keeping $||w||^2$ small promotes a large margin and simpler models.
        - Minimizing misclassification error: Penalizing $\sum^n_{i=1} \xi_i$ ensures the model correctly classifies most training instances.


##### Interpretation of the Objective Function
The function combines two objectives:
1. Maximizing the margin: Achieved by minimizing $\frac{1}{2} ||w||^2$ resulting in a decision boundary that is as far as possible from the nearest data points (support vectors).
    - A larger margin improves the model's generalization ability (i.e., it performs better on unseen data).
2. Minimizing classification errors: Achieved by penalizing the slack variables $\xi_i$ via the term $C \sum^n_{i=1} \xi_i$, which accounts for points within or outside the margin.
    - Points misclassified or within the margin are penalized, encouraging the model to position the hyperplane optimally.

#### How SVM Uses Hinge Loss

##### **2.1 How violations and misclassification are measured in soft margin classification.**
Soft Margin Classification:
- The soft margin SVM introduces flexibility by allowing violations of the margin through $\xi_i$, making it suitable for non-linearly separable and noisy datasets.
- For linearly inseparable data, SVM introduces slack variables ($\xi_i$) to allow some points to violate the margin constraints.
- The hinge loss incorporates these violations, enabling SVM to work with noisy or overlapping data.

In soft margin classification, violations and misclassification are measured using slack variables ($\xi_i$), which represent the extent to which a data point deviates from the ideal separation defined by the decision boundary and margin.

The Role of the Slack Variables ($\xi_i$)
- Slack variables are introduced in the soft margin SVM to allow for some data points to:
    - Lie inside the margin (violations).
    - Be misclassified (on the wrong side of the decision boundary).
- Each data point i has an associated slack variable ($\xi_i \geq 0$), which quantifies its violation of the margin constraints.

Decision Boundary and Constraints
- The decision boundary in soft margin classification is defined by:
$$ y_i (w \cdot x_i + b) \geq 1  - \xi_i , \xi_i \geq 0 $$

- When $ y_i (w \cdot x_i + b) \geq 1$:
    - The data point is correctly classified and outside the margin. No violation occurs, so $\xi_i = 0$.
- When $ 0 < y_i (w  \cdot x_i + b) < 1$: 
    - The data point is correctly classified but lies inside the margin. The margin is violated, and $\xi_i > 0$.
- When $ y_i (w \cdot x_i + b) < 0$:
    - The data point is misclassified and on the wrong side of the decision boundary. This is a severe violation, with a larger $\xi_i$.

Measuring Margin Violations
- The slack variable $\xi_i$ measures the distance a point falls short of the margin boundary. Specifically:
    - $\xi_i = 0$, the point lies on or outside the correct margin.
    - $0 < \xi_i \leq 1$, the point is inside the margin but correctly classified.
    - $\xi_i > 1$, the point is misclassified.

- Total Margin Violation
    - The total violation across all data points is:
$$ \sum^N_{i=1} \xi_i$$

Misclassification
- Misclassification occurs when a data point lies on the wrong side of the decision boundary:
$$y_i (w \cdot x_i + b)$$

- For misclassified points, $\xi_i >1$
    - The slack variable $\xi_i - 1$ represents the extent of misclassification.

- Misclassification Count
    - The number of misclassified points can be roughly estimated as:
        - number of misclassifications $\approx \sum^N_{i=1} 1 (\xi_i >1)$ where:
            - $1(\cdot)$ is an indicator function that equals 1 if the condition is true, and 0 otherwise.

- Example: assume we have the following
    - $y_i = +1$: Positive class.
        - The margin for $y_i = +1$ is defined as $(w \cdot x_i + b) \geq +1$

- Possible Cases:
    - Correct Classification Outside the Margin: $y_i(w \cdot x_i + b) \geq +1)$
        - No violation $\xi_i = 0$
    - Correct Classification Inside the Margin  $0 < y_i(w \cdot x_i + b) < 1)$
        - Margin violation occurs $\xi_i > 0$
    - Misclassified Point  $ y_i(w \cdot x_i + b) < 0)$
        - Severe violation $\xi_i > 1$

##### **2.2 Reasons for using `Regularization` in SVM**
Regularization in Support Vector Machines (SVMs) is crucial to ensure that the model generalizes well to unseen data. 
- Regularization introduces a penalty for overly complex models, preventing overfitting.
- Regularization in SVM controls the trade-off between:
    - Maximizing the margin: Ensuring the decision boundary is as far as possible from the nearest data points.
    - Minimizing misclassification errors: Allowing some points to fall inside the margin or on the wrong side of the decision boundary for better generalization.

Control Overfitting
- Reason: SVM aims to maximize the margin between classes while minimizing misclassification errors. Without regularization, the model might try to perfectly classify the training data, resulting in overfitting.
- Solution: Regularization balances the trade-off between achieving a larger margin (simpler model) and minimizing classification errors.
    - A larger regularization parameter (C): penalizes misclassifications more heavily, potentially leading to overfitting.
    - A smaller regularization parameter (C):  favors a larger margin and allows for more misclassifications, promoting generalization.

Handle Noisy Data
- Reason: Real-world datasets often contain noise, outliers, or mislabeled data points. Without regularization, SVM may overemphasize these noisy points, leading to a distorted decision boundary.
- Solution: Regularization reduces the influence of such noisy points by allowing some tolerance for misclassification, leading to a more robust model.

Promote Simpler Decision Boundaries
- Reason: Complex decision boundaries can lead to poor generalization on new data.
- Solution: Regularization encourages the SVM to find a simpler decision boundary by controlling the weight vector (w) through a regularization term in the objective function.

Avoid Curse of Dimensionality
- Reason: In high-dimensional spaces, the risk of overfitting increases because the model has more capacity to fit the training data perfectly.
- Solution: Regularization reduces the model's flexibility, preventing overfitting in high-dimensional feature spaces.

Improve Generalization Performance
- Reason: A model that fits the training data too closely may fail to generalize to unseen data.
- Solution: Regularization ensures that the SVM focuses on the most informative patterns in the data, improving performance on test data.

Kernel Methods and Regularization
- Reason: When using kernel functions (e.g., RBF, polynomial), the feature space is transformed into a higher dimension, increasing the model's capacity to overfit.
- Solution: Regularization mitigates overfitting by constraining the optimization process, ensuring the model finds a balance between complexity and accuracy.

##### Type of regularization used in Soft Margin Classification for Support Vector Machines (SVMs)
Type used is L2 regularization.

L2 Regularization in the Objective Function: 
$$Minimize_{w,b,\xi}: \frac{1}{2} ||w||^2 + C \sum^n_{i=1} \xi_i$$

- Where:
    - $\frac{1}{2} ||w||^2$: represents L2 regularization, as it minimizes the squared Euclidean norm of the weight vector w. 
        - It helps in maximizing the margin by penalizing larger weight values, which results in a smoother and more generalizable decision boundary.
    - $C \sum^n_{i=1} \xi_i$: his term penalizes margin violations (misclassification or points lying within the margin). 
        - The parameter C determines the penalty strength.

Why L2 Regularization?
- L2 regularization is chosen because:
    - It encourages smaller weight magnitudes ($w_i^2), which leads to a more stable model less sensitive to noise in the data.
    - It avoids overfitting by penalizing complex decision boundaries.
    - The quadratic term $∥w∥^2$ ensures that the solution is smooth and generalizes well to unseen data.

Mathematical Interpretation
- The L2 regularization term $\frac{1}{2} ||w||^2$ ensures that the weight vector w remains small, which effectively controls the model's complexity. 
    - Smaller weights correspond to a more stable and less overfitted model.

Connection to Dual Formulation
- In the dual formulation, the regularization parameter C indirectly limits the Lagrange multipliers $\alpha_i$:

$$ 0 \leq \alpha_i \leq C $$

This constraint ensures that the influence of each data point on the decision boundary is limited, balancing the trade-off between margin maximization and classification accuracy.

##### **2.3 Hyperparameter C in soft margin classification**
The objective function for soft margin SVM is:

$$Minimize_{w,b,\xi}: \frac{1}{2} ||w||^2 + C \sum^n_{i=1} \xi_i$$

Hyperparameter C
- Determines the trade-off between maximizing the margin (pathway width) and minimizing classification errors.
- Determines how much weight is given to minimizing slack variables relative to maximizing the margin width.

Trade-Off Parameter (C):
- Governs the trade-off between the two components of the objective function:
    - Large C: Focuses on minimizing misclassification, potentially at the cost of a smaller margin (risk of overfitting).
        - Strongly penalizes misclassification.
        - Results in a smaller margin as the model tries to classify every point correctly.
        - May lead to overfitting, especially on noisy data.
    - Small C: Focuses on maximizing the margin, tolerating some misclassifications (risk of underfitting).
        - Allows more margin violations (misclassified points).
        - Results in a larger margin and simpler decision boundary.
        - Promotes better generalization, reducing the risk of overfitting.

High C: Narrower Pathway Width
- The penalty for margin violations ($C \sum \xi_i$) becomes significant.
- The SVM prioritizes classifying training points correctly over maximizing the margin width.
- The model becomes more sensitive to individual data points, which can lead to:
    - A narrower pathway width (smaller margin).
    - Overfitting, where the decision boundary conforms too closely to the training data.
- Behavior:
    - The margin shrinks to fit the data points tightly.
    - Misclassified points are heavily penalized, so the model tries to minimize their number at the cost of a smaller margin.

Low C: Wider Pathway Width
- The penalty for margin violations becomes less significant.
- The SVM focuses on maximizing the margin width, even if it means allowing some misclassified points.
- The model becomes less sensitive to noise and outliers, leading to:
    - A wider pathway width (larger margin).
    - Better generalization to unseen data.
- Behavior:
    - The decision boundary prioritizes a larger margin over perfect classification.
    - Misclassified points are tolerated, reducing the risk of overfitting.

##### Relationship Between C and Pathway Width
The pathway width (or margin width) is inversely related to C:
- High C: Narrower pathway (small margin).
- Low C: Wider pathway (large margin).

This trade-off reflects the bias-variance trade-off:
- High C: Low bias, high variance (more complex model).
- Low C: High bias, low variance (simpler model).

Practical Analysis of C and Pathway Width
- To analyze the relationship between C and margin width:
    - Train the SVM Model: Train models with different values of C.
    - Visualize the Decision Boundary:
        - Plot the decision boundary and margins for low, medium, and high C values.
        - Observe how the margin width and boundary placement change.
    - Evaluate Performance:
        - On training data, high C often results in lower misclassification rates.
        - On test data, low C often results in better generalization.

Impact of C on Generalization
- High C: The model prioritizes accuracy on the training data but risks overfitting due to a narrow margin.
- Low C: The model sacrifices some accuracy on the training data but generalizes better due to a wider margin.

##### **Dual Formulation of the Soft Margin Objective**
Objective function for soft margin classification in Support Vector Machines (SVMs) allows for some misclassification or margin violations in the dataset. 
- This makes the model more robust to noisy and non-linearly separable data.

The dual formulation is more computationally efficient for many datasets, especially when using kernels. It is expressed as:

$$ max_{\alpha} \sum^{N}_{i = 1} \alpha_i - \frac{1}{2} \sum^{N}_{i = 1} \sum^{N}_{j = 1} \alpha_i \alpha_i y_i y_j K(x_i, x_j) $$

Subject to:

$$ 0 \leq \alpha_i \leq C, \sum^{N}_{i = 1} \alpha_i  y_i = 0 $$

Where: 
- $\alpha_i$: Lagrange multipliers.
- $K(x_i, x_j)$: Kernel function, used for non-linear decision boundaries.
- C: Controls the range of $\alpha_i$, balancing margin width and classification error.

##### **Components in Relation to m Features**
The objective function in SVM for m-features balances the goals of maximizing the margin and minimizing misclassification through the weight vector w, bias b, and slack variables $\xi_i$. Regularization, via C, plays a key role in ensuring that the model generalizes well to unseen data.

- w: Weight vector of dimension m, one weight per feature, defines the hyperplane's orientation.
- $𝑥_𝑖 \in 𝑅^m$ Feature vectors in the m-dimensional space.
- Kernal $K(x_i, x_j)$: Allows mapping of $𝑥_𝑖$ into a higher-dimensional feature space for non-linear separability, indirectly involving m.

##### Intuition for m Features
- The dimension m dictates the complexity of the weight vector w, which defines the separating hyperplane.
- Larger m means a higher-dimensional feature space, potentially increasing the model's capacity but also the risk of overfitting.
- Regularization (C) ensures that the optimization remains robust, even with a large number of features.

### **Optimization**
- To solve the SVM objective, quadratic programming methods or optimization algorithms (e.g., SMO—Sequential Minimal Optimization) are used. 
- For large datasets, kernels or approximate methods are often applied.

##### **Tuning an SVM model**
Use `sklearn`'s [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html). 
- This procedure allows us to specify a set of possible parameters for a specific model.
    - `GridSearchCV` will then go through those parameters and try every possible combination of them (kind of like it's working through a grid in a systematic way - that's where the name comes from). 
    - `GridSearchCV` will then return the combination of parameters that resulted in a model with the best score. 
    - `GridSearchCV` makes use of **cross validation**, helping to ensure the robustness of it's results.

Grid search is a systematic method for hyperparameter optimization that evaluates a predefined set of hyperparameters for a machine learning model, such as an SVM.

**Hyperparameters**

C: Regularization parameter.
- Higher C: Focuses on minimizing classification errors (lower margin, more overfitting).
- Lower C: Allows more classification errors (larger margin, more underfitting).

Kernel: Specifies the kernel function.
- Linear: Best for linearly separable data.
- Polynomial/RBF (Radial Basis Function): Handles nonlinear decision boundaries.

Gamma: Used with RBF and polynomial kernels.
- Controls the influence of a single training example.
    - Lower values: More generalized decision boundaries.
    - Higher values: Tighter fit around data points.

Degree: Relevant for polynomial kernels, representing the polynomial degree.

Steps:

1. Create a dictionary that contains the parameters you want to tune as `keys` and all the different options you want to test for those parameters as `values`.

In [None]:
parameters = {'kernel':('linear', 'rbf'), 
              'C':(0.25,1.0),
              'gamma': (1,2)}

2. Instantiate an SVC classifier and tell `GridSearchCV` to test it using the parameters we previously specified:

In [None]:
svm = SVC()
clf = GridSearchCV(svm, parameters)
clf.fit(X_train,y_train)

**Understanding the Output of Grid Search for SVM**

1. Best Parameters (best_params_)

This indicates the combination of hyperparameters that resulted in the best cross-validation score.

    - {'C': 10, 'gamma': 0.1, 'kernel': 'rbf'}
        - A regularization strength of C=10.
        - A kernel function of Radial Basis Function (RBF).
        - Gamma value of 0.1.

Extract the Best Parameters:
- Use grid_search.best_params_ to identify the best-performing combination.

2. Best Score (best_score_)

This is the highest cross-validation score achieved for the best parameter combination. It indicates how well the model generalized during validation.

    - 0.93
    - The best parameter combination resulted in 93% accuracy during cross-validation.

Examine the Best Score:
- Use grid_search.best_score_ to see the best validation accuracy achieved (e.g., 0.90).

Make Predictions:
- Use the grid_search.best_estimator_ to make predictions on new data.

3. Complete Results (cv_results_)

- A dictionary containing detailed results for all parameter combinations evaluated during grid search. Key fields:
    - mean_test_score: Average cross-validation score for each parameter set.
    - std_test_score: Standard deviation of scores across folds (indicates variability).
    - params: Parameter combinations corresponding to the scores.

            - {'mean_test_score': [0.91, 0.93, 0.89],
            'std_test_score': [0.01, 0.02, 0.03],
            'params': [{'C': 1, 'gamma': 0.1, 'kernel': 'rbf'},
                        {'C': 10, 'gamma': 0.1, 'kernel': 'rbf'},
                        {'C': 1, 'gamma': 0.01, 'kernel': 'rbf'}]}

            - The best score (0.93) corresponds to {'C': 10, 'gamma': 0.1, 'kernel': 'rbf'}
            - The variability of scores (e.g., 0.02) reflects the model's consistency during cross-validation.

4.  Best Estimator (best_estimator_)

The trained SVM model with the best parameters. This can be used for predictions.

    - SVC(C=10, gamma=0.1, kernel='rbf')

##### How to Use the Grid Search Results
- Best Parameters: Use `grid_search.best_params_` to train the final SVM model on the full training data for optimal performance.
- Best Estimator: Use `grid_search.best_estimator_` directly for prediction.
- Scoring and Ranking: The `mean_test_score in cv_results_` can be used to evaluate how different parameter combinations perform.
- Variability: Use `std_test_score` to assess how consistent the model performance is across folds. Lower variability indicates a robust model.

**GridSearch Output**

    - GridSearchCV(cv=None, error_score=nan,
                estimator=SVC(C=1.0, break_ties=False, cache_size=200,
                            class_weight=None, coef0=0.0,
                            decision_function_shape='ovr', degree=3,
                            gamma='scale', kernel='rbf', max_iter=-1,
                            probability=False, random_state=None, shrinking=True,
                            tol=0.001, verbose=False),
                iid='deprecated', n_jobs=None,
                param_grid={'C': (0.25, 1.0), 'gamma': (1, 2),
                            'kernel': ('linear', 'rbf')},
                pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
                scoring=None, verbose=0)

1. cv=None:
- By default, cv=None uses 5-fold cross-validation to evaluate the performance of each hyperparameter combination.

2. error_score=nan:
- Specifies what happens if a model fails during training. 
    - If set to nan, the model skips that combination and assigns a score of nan.

3. estimator=SVC(...):
- The base model being optimized, in this case, an SVM classifier (SVC).
- The parameters within the SVC object (e.g., C=1.0, kernel='rbf') represent its default settings, which may be overridden by the grid search.

4. param_grid={'C': (0.25, 1.0), 'gamma': (1, 2), 'kernel': ('linear', 'rbf')}:
- The hyperparameter combinations being evaluated:
    - C: Regularization parameter values [0.25,1.0].
    - gamma: Kernel coefficient values [1,2].
    - kernel: Kernel types [linear ,  rbf].
- The grid search will test all possible combinations of these parameters (a total of 2×2×2=8 combinations).

5. iid='deprecated':
- Refers to the Independent Identically Distributed assumption, which has been deprecated in Scikit-learn 0.24 and later. 
    - It is safe to ignore this unless you’re using an older version of Scikit-learn.

6. n_jobs=None:
- Specifies the number of CPU cores to use for parallel computation. None means it will run in serial mode on a single core.

7. pre_dispatch='2*n_jobs':
- Controls the number of jobs that get dispatched during parallel computation. 
    - Since n_jobs=None, this has no effect.

8. refit=True:
- After finding the best hyperparameter combination, the grid search automatically refits the model on the entire training dataset using those parameters.

9. return_train_score=False:
- If True, the results would include training scores in addition to validation scores. 
- Here, it is False, so only validation scores are calculated.

10. scoring=None:
- Indicates that the default scoring metric for the estimator (e.g., accuracy for classification) is used.

**This Configuration Means**

The grid search is tuning an SVM classifier with:
- Two values of C [0.25,1.0],
- Two values of gamma [1,2], and
- Two kernel types (linear and rbf).
Each of these 2×2×2=8 combinations is evaluated using 5-fold cross-validation.

The performance of each combination is assessed using the default scoring metric (accuracy for classification).

The best-performing combination is automatically selected and refitted on the entire training dataset.

### **Reasons for not using squared loss function in classification problems**

1. It is sensitive to outliers.
2. It does not align with the probabilistic interpretation of classification tasks.
3. It fails to emphasize the separation of classes effectively.
4. Alternatives like cross-entropy or hinge loss are better suited for optimizing classification models, focusing on class separation and meaningful probabilities.

Non-robustness to Outliers
- In Classification: Misclassified points, especially outliers, can disproportionately influence the decision boundary, leading to poor generalization.
- Squared loss penalizes large errors quadratically, 
    - meaning that a few instances with large prediction errors can dominate the loss function.

Misalignment with Classification Goals
- Nature of Classification: Classification problems aim to predict discrete labels or probabilities for class membership, focusing on correctly separating classes.
- Squared Loss Behavior: Squared loss minimizes the difference between predicted and true values. 
    - In classification, true labels are usually encoded as 0 or 1, and predictions outside [0,1] are meaningless probabilities. 
        - This can result in illogical outcomes for probabilities and suboptimal boundaries.

Poor Handling of Probabilities
- Probabilistic Interpretation: Classification models often interpret predictions as probabilities of class membership.
- Squared Loss Issues: It does not naturally account for the probabilistic nature of classification. 
    - loss functions, like log-loss (cross-entropy), directly optimize for probability-based interpretations, ensuring that predictions align better with actual class probabilities.

 Inappropriate Gradients for Classification
- Gradient Shape: Squared loss gradients are linear, meaning the gradient changes linearly with the error.
- Impact: In classification problems, small classification errors might still produce significant gradients, leading to inefficient updates. 
    - loss functions like hinge loss or cross-entropy loss prioritize the misclassified or uncertain points more effectively, which aligns with the goal of improving class separation.

Squared Loss Leads to Non-optimal Decision Boundaries
- Decision Boundary Nature: In classification, the goal is to maximize the margin between classes or ensure good separation.
- Squared Loss Focus: By trying to minimize the distance between predicted and true labels, squared loss tends to favor a compromise boundary, potentially leading to poorly separated classes, especially in non-linear classification problems.

Better Alternatives Exist
- Hinge Loss: Used in SVMs, it focuses on maximizing the margin and ensures only points near or across the boundary contribute to the loss.
- Cross-Entropy Loss: Used in logistic regression and neural networks, it optimizes probabilities directly, aligning well with classification goals.


### Advantages of SVMs
1. Effective in High-Dimensional Spaces
-  SVMs perform well when the number of features is large relative to the number of observations.
- Example: Applications in text classification or genomics, where the feature space is often very high-dimensional.

2. Works Well with Clear Margins of Separation
- SVM aims to find the optimal hyperplane that maximizes the margin between classes, which ensures robust classification when classes are well-separated.
- Example: Binary classification tasks where the data is linearly separable.

3. Kernel Trick for Non-linear Data
- SVM uses the "kernel trick" to map non-linearly separable data into a higher-dimensional space where a linear separation is possible.
- Example: Radial Basis Function (RBF) and polynomial kernels can handle complex decision boundaries.

4. Regularization Through C Parameter
- The regularization parameter C controls the trade-off between maximizing the margin and minimizing classification errors, making SVMs flexible to different types of data distributions.
- Example: Adjusting C to avoid overfitting on small datasets or noisy data.

5. Robust to Overfitting (with Proper Tuning)
- By controlling the margin size and kernel functions, SVMs can generalize well, especially for small datasets.
- Example: SVMs perform better than other models when data has limited examples but a high feature count.

6. Effective for Outlier Detection
- SVM variants, such as one-class SVM, are used to detect anomalies by learning the boundaries of the majority class.
- Example: Fraud detection or network intrusion detection.

### Disadvantages of SVMs
1. High Computational Cost
- SVM training involves solving a convex optimization problem, which can become computationally expensive for large datasets.
- Example: For datasets with millions of samples, training can be significantly slower compared to models like logistic regression or decision trees.

2. Sensitive to Choice of Kernel
- The performance of SVM heavily depends on the choice of kernel function and its parameters (e.g., RBF kernel with parameters $𝛾$ and C).
- Example: Incorrect kernel choice may lead to poor performance or overfitting.

3. Inefficient for Large Datasets
- The complexity of SVMs scales with the size of the dataset ($O(n^2$) to ($O(n^3$), making it less suitable for massive datasets.
- Example: SVM may struggle with datasets containing millions of instances compared to neural networks or gradient-boosted trees.

4. Difficulty Handling Noisy Data
- SVMs try to maximize the margin and are sensitive to mislabeled data points, which can shift the decision boundary significantly.
- Example: In datasets with a high degree of label noise, SVMs may underperform compared to models with robust loss functions.

5. Lack of Probabilistic Output
- SVMs do not naturally provide probabilities for predictions. While this can be approximated using Platt scaling or cross-validation, the results are not as interpretable as probabilistic models.
- Example: Logistic regression offers direct probabilities, which are more useful in some applications, like medical diagnosis.

6. Hyperparameter Tuning is Non-trivial
- Choosing the right values for hyperparameters like C, $𝛾$, and the kernel function often requires extensive grid search or cross-validation.
- Example: Poorly tuned parameters can lead to overfitting or underfitting, requiring careful experimentation.

7. Not Easily Scalable for Multiclass Problems
- SVMs are inherently binary classifiers. 
- For multiclass classification, strategies like 
    - one-vs-rest (OVR) or 
    - one-vs-one (OVO) must be used, adding complexity and computational cost.
- Example: For 10 classes, OVO requires 10×(10−1)/2=45 classifiers to be trained.

When to Use SVMs
- Best Use Cases:
    - High-dimensional datasets with clear margins of separation.
    - Small-to-medium-sized datasets with complex decision boundaries.
    - Applications where interpretability of the decision boundary is important (e.g., feature weights in a linear kernel).
- Not Ideal For:
    - Large datasets due to computational cost.
    - Noisy datasets where robust models like random forests or neural networks might outperform.
    - Problems requiring probabilistic outputs or interpretable probabilities.

##### Five common use cases that require probabilistic outputs or interpretable probabilities that SVM poorly performs to.

Key Challenges with SVMs in Probabilistic Scenarios
- Calibration Issues: SVM probabilities (from methods like Platt Scaling) are often less reliable than probabilities from inherently probabilistic models.
- Interpretability: Decision boundaries and margins are not intuitive for users who need to interpret confidence levels.
- Actionable Insights: Many use cases (e.g., credit scoring, fraud detection) require actionable thresholds or prioritization, which hinge on well-calibrated probabilities.

Medical Diagnosis
- Why Probabilities are Needed: In medical applications, probabilistic outputs help determine the likelihood of a disease or condition, allowing practitioners to weigh risks and make informed decisions.
    - Example: Predicting whether a patient has cancer with a 90% probability versus 55%.
- Why SVM Fails: SVM outputs are distances from the decision boundary, which don’t naturally translate to probabilities. While calibration techniques like Platt Scaling can convert these into probabilities, they often yield less reliable and less interpretable probabilities than models like logistic regression.

Fraud Detection
- Why Probabilities are Needed: In fraud detection, probabilities allow for setting thresholds based on the acceptable level of risk. For instance, transactions with a probability of fraud >95% may trigger an immediate block, while transactions with 60%-80% may require manual review.
    - Example: Flagging fraudulent transactions on an e-commerce platform.
- Why SVM Fails: SVMs don’t inherently provide probabilities for these thresholds, making it difficult to prioritize actions based on the confidence level of predictions. This lack of interpretability can lead to either overreaction (blocking too many transactions) or underreaction.

Customer Churn Prediction
- Why Probabilities are Needed: Businesses use churn probability to allocate resources effectively, targeting high-probability churners with retention offers. Probabilities help prioritize interventions.
    - Example: Predicting that a customer has a 70% chance of leaving allows the company to offer personalized discounts or incentives.
- Why SVM Fails: SVM’s non-probabilistic nature makes it hard to prioritize customers effectively. In contrast, logistic regression or gradient boosting models provide reliable churn probabilities, directly guiding resource allocation.

Marketing Campaign Effectiveness
- Why Probabilities are Needed: Campaign optimization often relies on the likelihood of conversion or engagement. For example, targeting customers with an 80% chance of responding to an ad is more efficient than targeting those with only 20%.
    - Example: Predicting the probability that a customer will click on an ad or make a purchase.
- Why SVM Fails: SVM outputs distances, not probabilities, making it harder to assign confidence levels to predictions. This lack of probabilistic output complicates the ranking of prospects for targeted campaigns.

### **Support Vector Regression (SVR)**
In Support Vector Regression (SVR), the hyperplane represents the regression function.
- SVR defines a central hyperplane and two boundary hyperplanes to create a margin of tolerance around the predictions. 
    - These boundaries are governed by the $\epsilon$-insensitive loss function.

Unlike classification, where a single hyperplane separates data into classes, 

##### **Equations for Hyperplanes in SVR**
Let $f(x) = < w, x > + b$ be the regression function, where:
- w is the weight vector.
- b is the bias term.
- x is the input feature vector.
- ⟨w,x⟩ denotes the dot product between w and x.

1. Central Hyperplane (Regression Function):
- The primary regression function is:
$$f(x) = < w, x > + b$$
- This hyperplane represents the predicted values for given inputs.

2. Boundary Hyperplanes (ε-margin):
- Two parallel hyperplanes are defined around the central hyperplane to create the $\epsilon$-margin:
- Upper boundary:
$$f(x) = < w, x > + b + \epsilon$$
- Lower boundary:
$$f(x) = < w, x > + b - \epsilon$$
- Predictions within this margin (between the upper and lower boundaries) are considered error-free.

3. Slack variables Constraints: ($\xi$ and $\xi*$)
- Slack variables $\xi$ and $\xi^*$ are introduced to allow violations of the margin:
- For data points above the upper boundary:
$$f(x) > < w, x > + b + \epsilon + \xi_i $$
- For data points below the lower boundary:
$$f(x) > < w, x > + b + \epsilon + \xi^*_i $$

##### **Objective Function in SVR:**
- To find the optimal hyperplane, SVR minimizes the objective function:

$$ Minimize_{w,b,\xi,\xi^*}: \frac{1}{2} ||w||^2 + C \sum^n_{i=1} (\xi_i + \xi_i^*)$$

- Subject to:
$$ y_i - < w, x_i > - b \leq \epsilon + \xi_i, $$
$$ < w, x_i > + b - y_i \leq \epsilon + \xi^*_i, $$
$$ \xi_i , \xi^*_i \geq 0. $$

- Where:
    - $||w||^2$: Regularization term, controlling model complexity term $\frac{1}{2}||w||^2$.
        - Model Complexity term is responsible for ensuring the simplicity of the model.
        - $||w||^2$ is the squared norm of the weight vector w.
            - which controls the flatness (or smoothness) of the regression function $f(x) = < w, x > + b$
        - Minimizing $||w||^2$ reduces overfitting by penalizing large coefficients, which would otherwise make the model too sensitive to small variations in the data.
    - Margin Violation Term: $C \sum^n_{i=1} (\xi_i + \xi_i^*)$ - Total error outside the $\epsilon$-margin.
        - This term measures the total margin violations (Slack variables $\xi$ and $\xi*$) for all data points.
            - Slack variables $\xi$ and $\xi*$: capture the extent to which data points fall outside the ε-insensitive margin.
            - C(Regularization parameter), balancing margin violations and model simplicity: controls the trade-off between minimizing margin violations and achieving a simpler model.
                - Larger C: Penalizes violations more heavily, leading to a smaller margin and a model that fits the data more closely (risking overfitting).
                - Smaller C: Allows more violations, leading to a wider margin and a simpler, more generalized model.
 
Interaction Between the Terms

1. Trade-off Between Complexity and Error:
- The first term $\frac{1}{2}||w||^2$ wants to minimize the model complexity (smaller $||w||^2$ smoother hyperplanes).
- The second term $C \sum^n_{i=1} (\xi_i + \xi_i^*)$ penalizes errors, pushing the model to fit the data better.
- A balance between these two objectives ensures that the model is both simple and accurate.

2. Effect of 𝐶 on Interaction:
- High 𝐶:
    - The penalty for margin violations dominates.
    - The optimization prioritizes minimizing errors $(\xi_i + \xi_i^*)$
    - This leads to a model that fits the training data very closely but might overfit.
- High C:
    - The penalty for margin violations is relaxed.
    - The optimization focuses more on minimizing $||w||^2$, resulting in a smoother, generalized model that may tolerate some errors.

3. Balancing Goals:
- The interaction of these terms ensures that the regression hyperplane is not just fitting the data but also remains simple and interpretable.
- The hyperparameter C serves as a balancing lever, enabling the model to adjust to the desired trade-off between 
    - fitting the data (error minimization) and 
    - avoiding overfitting (complexity minimization).

##### Key Insights on the Hyperplanes:
Role of Hyperplanes:
- The central hyperplane represents the predicted regression function.
- The boundary hyperplanes define the tolerance region (ε-margin) where predictions are considered acceptable.

Effect of Slack Variables:
- Points outside the ε-margin introduce error, measured by the slack variables $\xi_i$ and $\xi^*_i$ .

Geometric Interpretation:
- The hyperplanes adapt to the data, minimizing the margin violations while maintaining simplicity (through $∣∣w∣∣ ^2$).

Role of Hyperparameters:
- C: Controls the trade-off between margin violations (slack variables) and model complexity ($∣∣w∣∣^2$).
- $\epsilon$: Defines the width of the margin.

##### **Error Calculation in SVR**
In Support Vector Regression (SVR), the error is calculated based on the ε-insensitive loss function. 
- This loss function defines a margin of tolerance (epsilon ($\epsilon$)) around the true target value, within which predictions are considered acceptable and do not contribute to the error. 
- SVR aims to find a regression function that minimizes error while maintaining a margin of tolerance.

Epsilon-insensitive loss function:
- The loss is calculated only for predictions that fall outside the epsilon margin.
- For a given prediction $f(x_i) and true target $y_i$, the loss is:

$$ L(f(x_i), y_i) = \{ 0 \text{ if } |f(x_i)- y_i| \leq \epsilon, \\ \{|f(x_i)- y_i| - \epsilon \text{ otherwise}$$

- If the predicted value is within $\epsilon$ of the true value, no error is incurred. Otherwise, the error is proportional to the distance outside the margin.

Slack variables ($\xi$ and $\xi*$)
- SVR introduces slack variables to handle violations of the margin for both above and below the true target value:
    - $\xi$: Measures the amount by which a prediction $f(x_i)$ exceeds the upper margin $𝑦_𝑖 + \epsilon$
    - $\xi^*$: Measures the amount by which a prediction $f(x_i)$ falls short of the lower margin $𝑦_𝑖 + \epsilon$
- Total error is the sum of these slack variables.

Objective Function in SVR:
- The SVR optimization problem minimizes both:
    - A regularization term $||w||^2$ to keep the model general.
    - The sum of slack variables (errors) to account for predictions outside the margin.

$$ Minimize_{w,b,\xi,\xi^*}: \frac{1}{2} ||w||^2 + C \sum^n_{i=1} (\xi_i + \xi_i^*)$$
- Subject to:
$$ y_i - < w, x_i > - b \leq \epsilon + \xi_i, $$
$$ < w, x_i > + b - y_i \leq \epsilon + \xi^*_i, $$
$$ \xi_i , \xi^*_i \geq 0. $$
- Where:
    - $||w||^2$: Regularization term, controlling model complexity.
    - C: Regularization parameter, balancing margin violations and model simplicity.
    - $\sum^n_{i=1} (\xi_i + \xi_i^*)$: Total error outside the $\epsilon$-margin.

##### Key Parameters Influencing Error:
Epsilon ($\epsilon$):
- Larger $\epsilon$: Allows more tolerance, leading to fewer points outside the margin (lower total error but less sensitivity to small variations in data).
- Smaller $\epsilon$: Reduces tolerance, increasing the sensitivity of the model to small deviations but possibly increasing total error.

Regularization Parameter (C):
- Large C: Penalizes errors more heavily, leading to smaller total error but less tolerance for deviations.
- Small C: Penalizes errors less, allowing more tolerance for margin violations.

##### SVR Error Calculation

In [None]:
from sklearn.svm import SVR
import numpy as np
import matplotlib.pyplot as plt

# Generate sample data
np.random.seed(42)
X = np.sort(5 * np.random.rand(100, 1), axis=0)
y = np.sin(X).ravel() + np.random.normal(scale=0.2, size=X.shape[0])

# Train SVR with RBF kernel
epsilon = 0.1
model = SVR(kernel='rbf', C=1.0, epsilon=epsilon)
model.fit(X, y)

# Predictions
y_pred = model.predict(X)

# Calculate error
error = np.maximum(0, np.abs(y_pred - y) - epsilon)

# Plot data
plt.scatter(X, y, color='blue', label='True data')
plt.plot(X, y_pred, color='red', label='SVR prediction')
plt.fill_between(X.ravel(), y_pred - epsilon, y_pred + epsilon, color='gray', alpha=0.3, label='Epsilon margin')
plt.legend()
plt.title('SVR with RBF Kernel')
plt.xlabel('X')
plt.ylabel('y')
plt.show()

# Print total error
print(f"Total error (sum of slack variables): {np.sum(error):.4f}")


# Bayesian Models
What It Means: 
- Bayesian models incorporate prior knowledge or beliefs with the data to update the probability of outcomes as new evidence is available.

Outcome Interpretation: 
- Each output is a probability distribution reflecting both prior knowledge and the new data, offering a range of likely outcomes.

Performance Measures:
- Log-Likelihood: Measures how well the model explains the data; higher values indicate better fit.

Lay Explanation: 
- Bayesian models are like revising a guess based on new evidence—updating beliefs as we get more information.
- After you have seen all the evidence, How much do you believe something and it assumes that you dont just believe or dis-believe something, you must assign a degree of belief. Its about callibrating the degree of belief to the strenght of the evidence.
- The degree of belief in a hypothesis should be determined by how likely the hypothesis is beforehand (before looking at the evidence). If it is True, what are the odd that youll see the evidence that you are seeing scaled by how common is that evidence across the board, whether the hypothesis is true of false.

The benefits of Naive Bayes are that the model is simple to build and is useful on large data sets. Further, the model makes an explicit assumption that the features are independent given the class label. What does this mean? Well first let's consider the concept of independence. Independence is a concept from probability theory and it implies that if we have two random variables $X$ and $Y$, then

$$
P(X \cap Y) = P(X)P(Y)
$$

This is where the qualifier "Naive" in "Naive Bayes" comes from. The assumption is Naive because it often does not hold. The assumption of independence implies that the model assumes that there is zero correlation among the features. Hence, the joint probability distribution $P(X, Y)$ can be obtained from the marginal probability distributions $P(X)$ and $P(Y)$ simply by multiplication. We will use the above independence assumption, conditional probability rules, and Bayes theorem to develop some theory for how the Naive Bayes model works.

$$ P(A|B) = \frac{P(B|A) \times P(A)}{P(B)} $$

$$ Posterior Probability = \frac{Likelihood \times Prior}{Evidence} $$

- Likelihood (P(B∣A)): refers to - if the Hypothesis is true, How likely is it that you will see the evidence that you are now seeing
    - The probability of observing event B given that A is true.
    - Represents how likely the evidence is under the hypothesis.
- Prior Probability (P(A)) is based on everything we know so far:
    - The initial probability of event A before observing any evidence.
    - Represents our prior knowledge or belief about
        - Historic Data
        - Established Theories
- Marginal Probability (P(B)) Evidence: Commonness of the data (How often to you expect to see that evidence accross the board, whether the idea you are testing is true or false)
    - The total probability of observing event B, considering all possible hypotheses.
    - Acts as a normalizing constant to ensure the posterior probability sums to 1.
- Posterior Probability (P(A∣B)):

Use Case: 
- To incorporate prior knowledge and quantify uncertainty.
    - The updated probability of event A after observing event B.
    - Represents our revised belief about A given the evidence.

Model Types: 
- Bayesian Linear Regression, 
- Bayesian Networks.

Application of Bayes' Theorem

Bayes' theorem is an important result in statistics and it allows us to obtain a posterior distribution given a prior distribution and a likelihood. Now that is a mouthful, but don't worry, we will walk you through it. First, let's denote class labels using $K = 1, \cdots, k$ - seem familiar? What we are trying to say here is that each class label corresponds to a number between $1$ and $k$. So, given this, the **independence assumption conditional on the class label is denoted**

$$
P(X_{1} \cap X_{2} \mid K = k) = P(X_{1} \mid K = k) P(X_{2} \mid K = k)
$$

### **Calculating Joint Probability: Independent vs Dependent Events**
Joint probability is a key concept for combining evidence and understanding relationships between events.

**Independent Events**
- Two events A and B are independent if the occurrence of one event does not affect the probability of the other event. In other words, knowing that one event has occurred gives no information about the other event.

Mathematical Condition
- For independent events:
$$ P(A|B) = P(A) \text{ and } P(B|A) = P(B) $$

**Joint Probability for Independent Events**
- For independent events A and B, the joint probability is the product of their individual probabilities:
$$ P(A \cap B) = P(A) \cdot P(B) $$
- Example:
    - The probability of rolling a 6 on a die (A) and flipping heads on a coin (B):
$$ P(A \cap B) = P(A) \cdot P(B) $$
$$ P(A \cap B) = \frac{1}{6} \cdot \frac{1}{2} $$
$$ P(A \cap B) = \frac{1}{12} $$

Why This Formula?
- Since the occurrence of A does not influence B (and vice versa), the probability of both events happening together is simply the product of their individual probabilities.
- This is a direct consequence of the definition of independence.

Interpretation:
- The events are independent because the outcome of the die roll does not affect the coin flip.

**Dependent Events**
- Two events A and B are dependent if the occurrence of one event affects the probability of the other event. In other words, knowing that one event has occurred provides information about the other event.

Mathematical Condition
- For dependent events:
$$ P(A|B) \neq P(A) \text{ and } P(B|A) \neq P(B) $$

**Joint Probability for Dependent Events**
- For dependent events A and B, the joint probability is calculated using conditional probability:
$$ P(A \cap B) = P(A|B) \cdot P(B) \text{ or } P(A \cap B) = P(B|A) \cdot P(A)  $$
- Example:
    - The probability of it raining (A) and being cloudy (B):
        - Suppose P(B) = 0.4 (probability of being cloudy).
        - Suppose P(A|B) = 0.7  (probability of rain given it’s cloudy).
    - then:
$$ P(A \cap B) = P(A|B) \cdot P(B) $$
$$ P(A \cap B) = 0.7 \cdot 0.4 $$
$$ P(A \cap B) = 0.28 $$

Why This Formula?
- Since the occurrence of A affects B (or vice versa), we cannot simply multiply their individual probabilities.
- Instead, we use the conditional probability P(A∣B) or P(B∣A) to account for the dependence between the events.

Interpretation:
- The probability of it being cloudy and raining is 0.28.
- The events are dependent because the probability of rain increases when it is cloudy.

Key Differences Between Independent and Dependent Events
|Aspect	|Independent Events	|Dependent Events|
|-------|-------------------|----------------|
|Definition	|Occurrence of one event does not affect the other.	|Occurrence of one event affects the other.|
|Condition	| $ P(A\|B) = P(A) \text{ and } P(B\|A) = P(B) $	| $ P(A\|B) \neq P(A) \text{ and } P(B\|A) \neq P(B)$ |
|Joint Probability|	$P(A \cap B)= P(A) \cdot P(B)$| $(P(A \cap B) = P(A\|B) \cdot P(B))$.|
|Example	|Rolling a die and flipping a coin.|	Rain and cloudiness.|

Why Joint Probability is Calculated Differently

Independent Events:
- The occurrence of one event does not influence the other, so their probabilities can be multiplied directly.
- Example: Rolling a die and flipping a coin are unrelated events.

Dependent Events:
- The occurrence of one event affects the probability of the other, so we must account for this relationship using conditional probability.
- Example: Rain is more likely when it is cloudy, so the probability of rain depends on cloudiness.

In [None]:
# Function to calculate joint probability for independent events
def joint_probability_independent(P_A, P_B):
    return P_A * P_B

# Function to calculate joint probability for dependent events
def joint_probability_dependent(P_A_given_B, P_B):
    return P_A_given_B * P_B

# Example 1: Independent events
P_A = 1/6  # Probability of rolling a 6
P_B = 1/2  # Probability of flipping heads
P_A_and_B_independent = joint_probability_independent(P_A, P_B)
print(f"Joint Probability (Independent Events): {P_A_and_B_independent:.4f}")

# Example 2: Dependent events
P_B_cloudy = 0.4  # Probability of being cloudy
P_A_given_B_rain = 0.7  # Probability of rain given it's cloudy
P_A_and_B_dependent = joint_probability_dependent(P_A_given_B_rain, P_B_cloudy)
print(f"Joint Probability (Dependent Events): {P_A_and_B_dependent:.4f}")

In the expression above $X_{1}$ and $X_{2}$ denote random predictor variables. Keep this in mind. Bayes' theorem states that

$$
P(K = k \mid X) = \frac{P(X \mid K= k) P(K = k)}{P(X)}
$$

In the expression above:

- $P(K = k \mid X)$ is called the _posterior probability_ distribution,

- $P(K = k)$ is the *prior*, and

- $P(X \mid K= k)$ is the _likelihood_.

The posterior tells us what the probability is of the class being $k$ given a particular observation. The prior is basically a measure of how likely we think it is for any observation to be assigned to a particular class before we have observed any observations. The likelihood gives us a measure of what the data says about the probability that the observation belongs to class $k$.

Bayes' Theorem is used in many real-world applications, such as:
- Spam Filtering: Classifying emails as spam or not spam based on the presence of certain keywords.
- Medical Diagnosis: Determining the probability of a disease given a set of symptoms.
- Machine Learning: Building probabilistic models like Naive Bayes classifiers.

##### **General algorithm:**

**Step 1: For each class $k$ do:**
- Find the likelihood $P(X_{1} \cap X_{2} \mid K = k) = P(X_{1} \mid K = k)P(X_{2} \mid K = k)$, using only the observations where the class is $k$ in the data. 
- Compute a prior probability for the current class $k$ = $\frac{observations \space in \space class \space k}{total \space number \space of \space observations}$. 
- Use Bayes' theorem, with a denominator (i.e.: the evidence $P(X)$) of $1$, to compute the posterior probability distribution.
    
    
**Step 2: At test time** 
- observations are assigned to classes with the highest posterior probability $P(X_{1} \cap X_{2} \mid K = k)$.

**Generalise the model to include an arbitrary number of random predictor variables instead of 2, i.e.:**
$$ X_{1}, \cdots, X_{p} $$

In this case
$$ P(K = k \mid X_{1} \cap X_{2} \cdots \cap X_{p}) = \frac{P(X_{1} \cap X_{2} \cdots \cap X_{p} \mid K= k) P(K = k)}{P(X)}$$

Now remember the assumption in the back of your mind. This allows us to simplify the expression above to become
$$ P(K = k \mid X_{1} \cap X_{2} \cdots \cap X_{p}) = \frac{P(X_{1}\mid K= k) P(X_{2}\mid K= k) \cdots P(X_{p}\mid K= k) P(K = k)}{P(X)} $$

A special mathematical symbol allows us to represent the product $P(X_{1}\mid K= k) P(X_{2}\mid K= k) \cdots P(X_{p}\mid K= k)$ as $\prod_{i = 1}^{p} P(X_{i}\mid K= k)$. 
- Hence, the expression above becomes:
$$ P(K = k \mid X_{1} \cap X_{2} \cdots \cap X_{p}) = \frac{\prod_{i = 1}^{p} P(X_{i}\mid K= k)P(K = k)}{P(X)} $$

**Consider how the Bayes classifier assigns observations to a particular class** 
- One method, called [maximum a posteriori (MAP)](https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation), maximises $P(K = k \mid X_{1} \cap X_{2} \cdots \cap X_{p})$ for $K = 1, \cdots, k$ to assign the observation to the correct class. 

Let's consider what this means for $K = 2$. In this case we calculate

$$
P(K = 1 \mid X_{1} \cap X_{2} \cdots \cap X_{p}) \\
P(K = 2 \mid X_{1} \cap X_{2} \cdots \cap X_{p}) \\
$$

If $P(K = 1 \mid X_{1} \cap X_{2} \cdots \cap X_{p}) > P(K = 2 \mid X_{1} \cap X_{2} \cdots \cap X_{p})$ then we say that the observation $\underline{X}$ is assigned to class $1$ else it is assigned to class $2$. This concept can be generalised to more classes. 

We just want to make something clear here.
- An observation i.e.: something that actually happened, is denoted
$$\underline{x}_{i} = (x_{i,1}, x_{i,2}, \cdots, x_{i,p})$$
- But a set of random variables, which captures all possible things that can happen for all observations, is denoted
$$\underline{X} = (X_{1}, X_{2}, \cdots, X_{p})$$
- Using this notation, $X_{i}, i = 1 \cdots, p$ denotes a feature.

### A word on Scikit learn

`sklearn` provides three implementations of the Naive Bayes method:

   a) **Gaussian:** It is used in classification and it assumes that features follow a normal distribution.

   b) **Multinomial:** It is used for discrete counts. For example, let’s say,  we have a text classification problem. Here we can consider bernoulli trials which is one step further and instead of “word occurring in the document”, we have “count how often word occurs in the document”, you can think of it as “number of times outcome number x_i is observed over the n trials”.

   c) **Bernoulli:** The binomial model is useful if your feature vectors are binary (i.e. zeros and ones). One application would be text classification with a ‘bag of words’ model where the 1s & 0s are “word occurs in the document” and “word does not occur in the document” respectively.

These implementations are merely the choice of our probability distribution $P$. We choose an implementation based on the nature of the features (i.e. predictor variables) in our data.

In [None]:
# Step 1: import the libraries that we will need
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, accuracy_score, log_loss

import warnings
warnings.filterwarnings("ignore")

# Step 2: Make use of the Breast Cancer Dataset.
# Load the breast cancer data
X, y = load_breast_cancer(return_X_y=True)
X.shape
y.shape

# Step 3: Fit the model
# Get training and testing data 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Define the model 
naive_bayes = GaussianNB()
# Fit the model 
naive_bayes.fit(X_train, y_train)

##### Assess model performance

So far, we've been using the classification report and confusion matrices to assess classification model performance. However, such metrics don't do a good job at highlighting how confident our model is in its predictions. 

Enter the [log loss](http://wiki.fast.ai/index.php/Log_Loss) function which, unlike other metrics, can penalise predictions based on how confident a model is with those predictions. For example, if our model predicts the wrong class with high probability, the log loss penalises it more (i.e.: assigns higher log loss) compared to a model that predicts the wrong class with low probability. As such, we generally feed class probabilities into the log loss function instead of the actual class predictions (i.e. thresholded probalities). 

For the log loss metric, lower is better, i.e.:, a perfect model would have a log loss of 0.

In [None]:
# Get predictions on the test set 
y_hat = naive_bayes.predict_proba(X_test)
# Calculate the log loss (this was imported from sklearn above somewhere) 
print("The log loss error for our model is: ", log_loss(y_test, y_hat))

##### Plot the decision boundary 
It is recommended to visualise the decision boundary of our classifier where possible. Unfortunately, in this case, we have around 31 different variables (features + response) and we can visualise only 3 at a time. So let's do just that. 

Below, we create a visualisation which can represent the decision boundary between any two features in $X$. The $y$ (i.e. class label) is indicated by the color of each data point. The decision boundary is the line separating the two regions of blue and red, such that, any point falling into the red region, is assigned the red label (class 1) and any point falling into the blue region is assigned the blue label (class 2).

To change which features you want to compare, simply change the values for `i` and `j`.

In [None]:
i = 0; j = 1
naive_bayes.fit(X[:, [i, j]], y)
fig = plt.figure(figsize=(8, 8))
ax1 = fig.add_subplot(111)
 
x_min, x_max = X[:, i].min(), X[:, i].max()
y_min, y_max = X[:, j].min(), X[:, j].max()
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 1000), np.linspace(y_min, y_max, 1000))

y_hat = naive_bayes.predict(np.concatenate((xx.reshape(-1,1), yy.reshape(-1,1)), axis=1))
y_hat = y_hat.reshape(xx.shape)

ax1.pcolormesh(xx, yy, y_hat, cmap=plt.cm.get_cmap('RdBu_r'))
ax1.scatter(X[:, i], X[:, j], c=y, edgecolors='k', cmap=plt.cm.get_cmap('RdBu_r'))
ax1.set_xlabel('Feature '+str(i))
ax1.set_ylabel('Feature '+str(j))
ax1.set_xlim(xx.min(), xx.max())
ax1.set_ylim(yy.min(), yy.max())
ax1.set_xticks(())
ax1.set_yticks(())
plt.show()

### Conditional Probability Using Bayes' Theorem
Bayes' Theorem is used to update probabilities based on new evidence.

Formula for Bayes' Theorem:
$$ P(A|B) = \frac{P(B|A) \times P(A)}{P(B)} $$
- where:
    - P(A∣B) = Probability of event A occurring given that B has occurred (Posterior Probability)
    - P(B∣A) = Probability of event B occurring given that A has occurred (Likelihood)
    - P(A) = Prior probability of event A occurring (Prior Probability)
    - P(B) = Total probability of event B occurring (Evidence / Normalization Factor)

##### Example 1: Disease Diagnosis
- Event A: A person has a disease
- Event B: The person tests positive

Given:
- P(A) = 0.01 (1% of the population has the disease)
- P(B | A) = 0.95 (95% test positive if they have the disease)
- P(B | Not A) = 0.05 (5% false positive rate if they don’t have the disease)

We calculate P(B) using the **Law of Total Probability**:

$$ P(B) = P(B|A) \times P(A) + P(B| \urcorner A) \times P(\urcorner A) $$
$$ P(B) = (0.95 \times 0.01) + (0.05 \times 0.99) $$

Now, apply Bayes' Theorem to find P(A | B):
$$ P(A|B) = \frac{P(B|A) \times P(A)}{P(B)} $$
$$ P(A|B) = \frac{ 0.95 \times 0.01}{0.059} $$

If you get a positive test result, your actual probability of having the disease is much lower than expected due to the false positive rate.

In [1]:
# Given probabilities
P_A = 0.01   # Prior: Probability of having the disease
P_B_given_A = 0.95  # Likelihood: Probability of testing positive if diseased
P_B_given_not_A = 0.05  # False Positive Rate: Probability of testing positive without disease
P_not_A = 1 - P_A  # Probability of not having the disease

# Total probability of testing positive
P_B = (P_B_given_A * P_A) + (P_B_given_not_A * P_not_A)

# Bayes' Theorem: Probability of having the disease given a positive test
P_A_given_B = (P_B_given_A * P_A) / P_B

print(f"P(A | B) = {P_A_given_B:.4f}")  # Final posterior probability

P(A | B) = 0.1610


##### Example: Calculating Conditional Probabilities
Problem Statement: Suppose we have the following data about a disease and a test for it:

Prevalence of the disease (P(D)): 1% of the population has the disease.

Test accuracy:
- If a person has the disease, the test is positive 99% of the time : P(T+∣D)=0.99
- If a person does not have the disease, the test is positive 5% of the time: P(T+∣¬D)=0.05

We want to calculate:
- The probability that a person has the disease given that the test is positive: P(D∣T+
- The probability that a person does not have the disease given that the test is positive P(¬D∣T+)

**Step 1: Define the Probabilities**
- P(D)=0.01 (1% of the population has the disease).
- P(¬D)= 1−P(D)= 0.99.
- P(T+∣D)= 0.99.
- P(T+∣¬D)= 0.05.

**Step 2: Compute P(T+)**

The **marginal probability** of a positive test result (P(T+)) is:
$$ P(T+) = P(T+|D) \cdot P(D) + P(T+|¬D) \cdot P(¬D) $$

Substitute the values:
$$ P(T+) = (0.99) \cdot P(0.01) + (0.05) \cdot (0.99) $$
$$ P(T+) = 0.0099 + 0.0495 $$
$$ P(T+) = 0.0594 $$

**Step 3: Apply Bayes' Theorem**

Probability of having the disease given a positive test (P(D∣T+)):
$$ P(D|T+) = \frac{P(T+ |D) \cdot P(D)}{P(T+)} $$
$$ P(D|T+) = \frac{0.99 \cdot 0.01}{0.0594} $$
$$ P(D|T+) \approx 0.1667 $$

Probability of not having the disease given a positive test P(¬D∣T+):
$$ P(¬D|T+) = \frac{P(T+ |¬D) \cdot P(¬D)}{P(T+)} $$
$$ P(¬D|T+) = \frac{0.05 \cdot 0.99}{0.0594} $$
$$ P(¬D|T+) \approx 0.8333 $$

Interpretation
- Even if the test is positive, there is only a 16.67% chance that the person actually has the disease.

This is because the disease is rare in the population (low prior probability), and the test has a significant false positive rate.

In [None]:
# Define the probabilities
P_D = 0.01  # P(D)
P_not_D = 1 - P_D  # P(¬D)
P_Tplus_given_D = 0.99  # P(T+ | D)
P_Tplus_given_not_D = 0.05  # P(T+ | ¬D)

# Compute P(T+)
P_Tplus = (P_Tplus_given_D * P_D) + (P_Tplus_given_not_D * P_not_D)

# Apply Bayes' Theorem
P_D_given_Tplus = (P_Tplus_given_D * P_D) / P_Tplus
P_not_D_given_Tplus = (P_Tplus_given_not_D * P_not_D) / P_Tplus

# Print results
print(f"P(D | T+): {P_D_given_Tplus:.4f}")
print(f"P(¬D | T+): {P_not_D_given_Tplus:.4f}")

## Simple Bayes Classifier | Naive Bayes Classifier

A simple Bayes classifier (often referred to as a Naive Bayes classifier) is a probabilistic machine learning model used for classification tasks, particularly for two-class problems. It is based on Bayes' Theorem and makes strong **independence assumptions between features**.
- The term "naive" refers to the assumption that all features are conditionally independent given the class label. Despite this simplifying assumption, Naive Bayes often performs well in practice.

Below are the salient features of a simple Bayes classifier, including the Bayes decision boundary and Bayes error rate:

### Bayes' Theorem
Bayes' Theorem is a fundamental concept in probability theory that describes how to update the probabilities of hypotheses when given evidence. It is widely used in machine learning, statistics, and data science for tasks like classification, inference, and decision-making.

The classifier is based on Bayes' Theorem, which calculates the posterior probability of a class given the input features. For two classes $C_1$ and $C_2$,the posterior probability for class $C_i$ given a feature vector $x = (x_1, x_2, ..., x_n)$ is:

$$ P(C_i|x) = \frac{P(x|C_i) \times P(C_i)}{P(x)} $$
- where: 
    - $P(x|C_i)$: Likelihood of observing x given class $C_i$
    - $P(C_i)$: Prior probability of class $C_i$
    - $P(x): Marginal probability of x (acts as a normalizing constant).

### Naive Assumption
The classifier assumes that the features $x_1, x_2, ..., x_n$ are **conditionally independent** given the class. This simplifies the **likelihood calculation**:
$$ P(x|C_i) = \prod^n_{j = 1} P(x_j | C_i)$$

This assumption is called "naive" because it ignores potential dependencies between features.

### Decision Rule
The classifier assigns the input x to the class with the highest posterior probability. For two classes, the decision rule is:
$$ \text{Assign x to } C_i \text{ if } P(C_1|x) > P(C_2|x), \text{ else assign to } C_2 $$

This can also be expressed in terms of the logarithm of the posterior probabilities to simplify computations:
$$ \text{Assign x to } C_i \text{ if } Log P(C_1|x) > Log P(C_2|x)$$

### Bayes Decision Boundary
The Bayes decision boundary is the set of points in the feature space where the posterior probabilities of the two classes are equal:
$$ P(C_1 | x) = P(C_2| x)$$

This boundary separates the feature space into regions corresponding to each class. For a simple Bayes classifier, the decision boundary is often linear or quadratic, depending on the distribution of the features.

### Bayes Error Rate
The Bayes error rate is the minimum possible error rate for a given classification problem, achieved by the Bayes classifier. It represents the inherent uncertainty in the data due to overlapping class distributions. Mathematically, it is defined as:

$$Bayes Error Rate = 1 − E_x [max_{i} P(C_i|x)] $$
- where:
    - $E_x$: Expectation over the feature space.
    - $max_i P(C_i|x)$: Maximum posterior probability for the true class.

The Bayes error rate is a theoretical lower bound on the classification error and cannot be reduced further, even with a perfect model.

### **Conditions for Applying Naive Bayes**
Conditional Independence:
- The Naive Bayes classifier assumes that features are conditionally independent given the class label. This assumption may not hold in real-world data, but the classifier often performs well even when the assumption is violated.

Discrete or Continuous Features:
- Naive Bayes can handle both discrete (categorical) and continuous features.
    - For discrete features, the likelihood is computed using frequency counts.
    - For continuous features, the likelihood is typically modeled using a probability distribution (e.g., Gaussian).

Small to Medium-Sized Datasets:
- Naive Bayes is computationally efficient and works well with small to medium-sized datasets.

High-Dimensional Data:
- Naive Bayes performs well with high-dimensional data (e.g., text data with many features) because the independence assumption reduces the complexity of the model.

Balanced or Imbalanced Classes:
- Naive Bayes can handle both balanced and imbalanced class distributions, as it relies on probabilities rather than decision boundaries.

### **Calculate the Bayes Error Rate** 
- we need a real-world example where we know the true class-conditional distributions of the data. Let’s consider a simple binary classification problem with two classes, $C_1$ and $C_2$, and a single feature x. We’ll assume the feature x follows known probability distributions for each class.

Example: Medical Diagnosis
- Suppose we are building a classifier to diagnose a disease ($C_1$:Disease , $C_2$:No Disease) based on a single test result x (e.g., a blood test score). The distributions of x for the two classes are:
    - $P(x|C_1)$: Normal distribution with mean $\mu$ = 10 and variance $\sigma^2$ = 4.
    - $P(x|C_2)$: Normal distribution with mean $\mu$ = 6 and variance $\sigma^2$ = 4.

Assume the prior probabilities are:
- $P(C_1)$ = (30% of the population has the disease).
- $ P(C_2)$ = (70% of the population does not have the disease).

**Step 1: Class-Conditional Distributions**
The probability density functions (PDFs) for the two classes are:
$$ P(x|C_1) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu_1)^2}{2\sigma^2}} = \frac{1}{\sqrt{8\pi}} e^{-\frac{(x-10)^2}{8}}$$
$$ P(x|C_2) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu_2)^2}{2\sigma^2}} = \frac{1}{\sqrt{8\pi}} e^{-\frac{(x-6)^2}{8}}$$

**Step 2: Posterior Probabilities**
Using Bayes' Theorem, the posterior probabilities are:
$$ P(C_i|x) = \frac{P(x|C_1) \times P(C_1)}{P(x)} $$
$$ P(C_i|x) = \frac{P(x|C_2) \times P(C_2)}{P(x)} $$
- Where:
    - $P(x) = P(x|C_1) \cdot P(C_1) + P(x|C_2) \cdot P(C_2)$

**Step 3: Bayes Decision Boundary**
The Bayes decision boundary occurs where the posterior probabilities are equal:
$$ P(C1|x) = P(C_2|x)$$

Substituting the likelihoods and priors:
$$ P(x|C_1) \cdot P(C_1) = P(x|C_2) \cdot P(C_2)$$

Taking the natural logarithm of both sides:
$$ -\frac{(x-10)^2}{8} + \ln (0.3) = -\frac{(x-6)^2}{8} + \ln (0.7) $$

Solve for x:
$$ (x - 6)^2 - (x - 10)^2 = 8 \ln (\frac{0.7}{0.3}) $$
$$ x^2 - 12x + 36 - (x^2 - 20x + 100) = 8 \ln (\frac{7}{3}) $$
$$ 8x - 64 = 8 \ln (\frac{7}{3}) $$
$$ x = 8 + (\frac{7}{3}) \approx 8 + 0.847 = 8.847 $$

So, the Bayes decision boundry is at x = 8.847

**Step 4: Bayes Error Rate**
The Bayes Error Rate is the probability of misclassification when using the Bayes decision rule. It is calculated as:

$$Bayes Error Rate = \int^{\infty}_{\infty} \min [P(C_1|x), P(C_1|x)] \cdot P(x)dx $$
In this example, the error occurs in the regions where:
- $ x < 8.847$ but the true class is $C_1$.
- $ x \geq 8.847$ but the true class is $C_2$

The Bayes Error Rate can be computed as:
$$Bayes Error Rate = P(C_1) \cdot P(x < 8.847|C_1) + P(C_2) \cdot P(x \geq 8.847| C_2)$$

Using the cumulative distribution function (CDF) of the normal distribution:
$$ P(x < 8.847|C_1) = \Phi (\frac{8.847 - 10}{2}) = \Phi (-0.5765) \approx 0.282 $$
$$ P(x \geq 8.847| C_2) = 1 - \Phi (\frac{8.847 - 6}{2}) = 1- \Phi (1.4235) \approx 1 - 0.923  = 0.077$$
Thus:
$$ Bayes Error Rate = 0.3 \cdot 0.282 + 0.7 \cdot 0.077 \approx 0.0846 + 0.0539 = 0.1385 $$

The Bayes Error Rate is approximately 13.85%.
_____________________

Interpretation of Outcomes

Bayes Error Rate (13.85%):
- This is the minimum possible error rate for this classification problem, given the overlapping distributions of the two classes.
- It represents the inherent uncertainty in the data due to the overlap between the class-conditional distributions.

Decision Boundary (x = 8.847):
- This is the optimal threshold for classifying a test result x into one of the two classes.

Implications:
- Even with a perfect model, the classifier will make errors 13.85% of the time due to the overlap in the distributions of the two classes.
- Reducing the Bayes Error Rate would require better features or less overlap between the class distributions.

##### **Key Characteristics**
- Simplicity: The naive Bayes classifier is easy to implement and computationally efficient.
- Scalability: It works well with high-dimensional data due to the independence assumption.
- Robustness: It performs well even with limited training data.
- Limitations: The independence assumption may not hold in real-world data, leading to suboptimal performance in some cases.

##### Differences Between Naive Bayes and Bayesian Classifier
|Aspect |Naive Bayes Classifier|	Bayesian Classifier  |
|-------|----------------------|-------------------------|
|Feature Independence |	Assumes all features are conditionally independent. |	Does not assume feature independence. |
|Complexity	 |Simple and computationally efficient.|	Can be more complex, depending on the model. |
|Likelihood Calculation	| Uses the product of individual feature likelihoods. |	May use joint distributions or more complex models.|
|Use Cases	| Text classification, spam filtering, etc. |	General probabilistic modeling and inference.  |
|Scalability	|Scales well with high-dimensional data.	|May struggle with high-dimensional data. |

In [None]:
import numpy as np
from scipy.stats import norm
from scipy.integrate import quad

# Define the parameters
mu1, sigma1 = 10, np.sqrt(4)  # Disease (C1)
mu2, sigma2 = 6, np.sqrt(4)   # No Disease (C2)
P_C1 = 0.3  # Prior probability of Disease
P_C2 = 0.7  # Prior probability of No Disease

# Define the class-conditional PDFs
def pdf_C1(x):
    return norm.pdf(x, loc=mu1, scale=sigma1)

def pdf_C2(x):
    return norm.pdf(x, loc=mu2, scale=sigma2)

# Define the marginal PDF P(x)
def marginal_pdf(x):
    return pdf_C1(x) * P_C1 + pdf_C2(x) * P_C2

# Define the posterior probabilities
def posterior_C1(x):
    return (pdf_C1(x) * P_C1) / marginal_pdf(x)

def posterior_C2(x):
    return (pdf_C2(x) * P_C2) / marginal_pdf(x)

# Find the Bayes decision boundary (where P(C1 | x) = P(C2 | x))
from scipy.optimize import fsolve

def decision_boundary_equation(x):
    return posterior_C1(x) - posterior_C2(x)

x_boundary = fsolve(decision_boundary_equation, x0=8)[0]  # Initial guess x0=8
print(f"Bayes Decision Boundary: x = {x_boundary:.4f}")

# Define the integrand for the Bayes Error Rate
def integrand(x):
    return min(posterior_C1(x), posterior_C2(x)) * marginal_pdf(x)

# Calculate the Bayes Error Rate using numerical integration
bayes_error_rate, _ = quad(integrand, -np.inf, np.inf)
print(f"Bayes Error Rate: {bayes_error_rate:.4f}")

In [None]:
import numpy as np
from scipy.stats import norm

# Step 1: Define the distributions
mu1, sigma1 = 10, np.sqrt(4)  # Disease (C1)
mu2, sigma2 = 6, np.sqrt(4)   # No Disease (C2)

# Step 2: Define the prior probabilities
P_C1 = 0.3  # P(C1)
P_C2 = 0.7  # P(C2)

# Step 3: Function to compute posterior probabilities
def posterior_probability(x, mu, sigma, prior):
    """
    Compute the posterior probability for a given class.
    """
    likelihood = norm.pdf(x, loc=mu, scale=sigma)  # P(x | C)
    return likelihood * prior  # P(x | C) * P(C)

# Step 4: Function to classify a new test score
def classify(x):
    """
    Classify a test score x as Disease (C1) or No Disease (C2).
    """
    # Compute unnormalized posterior probabilities
    P_C1_given_x = posterior_probability(x, mu1, sigma1, P_C1)
    P_C2_given_x = posterior_probability(x, mu2, sigma2, P_C2)
    
    # Normalize the probabilities
    P_x = P_C1_given_x + P_C2_given_x  # Marginal probability P(x)
    P_C1_given_x /= P_x
    P_C2_given_x /= P_x
    
    # Classify based on the higher posterior probability
    if P_C1_given_x > P_C2_given_x:
        return "Disease (C1)", P_C1_given_x, P_C2_given_x
    else:
        return "No Disease (C2)", P_C1_given_x, P_C2_given_x

# Step 5: Test the classifier with a new test score
x_new = 8  # New test score
classification, P_C1_given_x, P_C2_given_x = classify(x_new)
print(f"Classification: {classification}")
print(f"Posterior Probability of Disease (C1): {P_C1_given_x:.4f}")
print(f"Posterior Probability of No Disease (C2): {P_C2_given_x:.4f}")

## Simple Bayesian classifier

A Bayesian classifier  is a probabilistic machine learning model that uses Bayes' Theorem to predict the class of a given input based on the observed features.

Below are the salient features of a simple Bayes classifier, including the Bayes decision boundary and Bayes error rate:

### Bayes' Theorem
The classifier is based on Bayes' Theorem, which calculates the posterior probability of a class given the input features. For two classes $C_1$ and $C_2$,the posterior probability for class $C_i$ given a feature vector $x = (x_1, x_2, ..., x_n)$ is:

$$ P(C_i|x) = \frac{P(x|C_i) \times P(C_i)}{P(x)} $$
- where: 
    - $P(x|C_i)$: Likelihood of observing x given class $C_i$
    - $P(C_i)$: Prior probability of class $C_i$
    - $P(x): Marginal probability of x (acts as a normalizing constant).

### Decision Rule
The classifier assigns the input x to the class with the highest posterior probability.
$$ \text{Assign x to } C_i \text{ if } P(C_i|x) > P(C_j|x), \text{ for all } j \neq \text{to i} $$

### Naive Assumption
The classifier assumes that the features $x_1, x_2, ..., x_n$ are conditionally independent given the class. This simplifies the **likelihood calculation**:
$$ P(x|C_i) = \prod^n_{j = 1} P(x_j | C_i)$$

This assumption is called "naive" because it ignores potential dependencies between features.

#### Example: Email Spam detection with two features ($x_1$: free, $x_2$: money)

$$P(C_1) = \frac{3}{5} = 0.6 \text{ (probability of spam).} $$
$$P(C_2) = \frac{2}{5} = 0.4 \text{ (probability of not spam).} $$

**Step 1: Compute Likelihoods**

We compute the likelihoods $P(x_j | C_i)$ for each feature and class.
- For $C_1$ (Spam):
    - $P(x_1 = 1 | C_1) = \frac{2}{3} \quad \text{(``free'' appears in 2 out of 3 spam emails)}$
    - $P(x_1 = 0 | C_1) = \frac{1}{3}.$
    - $P(x_2 = 1 | C_1) = \frac{2}{3} \quad \text{(``money'' appears in 2 out of 3 spam emails).}$
    - $P(x_2 = 0 | C_1) = \frac{1}{3}.$
- For $C_2$ (Not Spam):
    - $P(x_1 = 1 | C_2) = \frac{0}{2} = 0 \quad \text{(``free'' does not appear in not spam emails).}$
    - $P(x_1 = 0 | C_2) = \frac{2}{2} = 1.$
    - $P(x_2 = 1 | C_2) = \frac{1}{2} \quad \text{(``money'' appears in 1 out of 2 not spam emails).}$
    - $P(x_2 = 0 | C_2) = \frac{1}{2}.$

**Step 2: Classify a New Email**
- For $C_1$ (Spam):
$$P(C_1 | x) \propto P(x_1 = 1 | C_1) \cdot P(x_2 = 1 | C_1) \cdot P(C_1)$$
$$P(C_1 | x) \propto \frac{2}{3} \times \frac{2}{3} \times 0.6 $$
$$= \frac{4}{9} \times 0.6$$
$$= 0.2667$$

- For $C_2$ (Not Spam):
$$P(C_2 | x) \propto P(x_1 = 1 | C_2) \cdot P(x_2 = 1 | C_2) \cdot P(C_2)$$
$$P(C_2 | x) \propto 0 \times \frac{1}{2} \times 0.4 = 0$$

Since $P(C_1 | x) > P(C_2 | x)$, the email is classified as $\text{spam}$.


In [None]:
import numpy as np

# Step 1: Define the training data
# Features: "free" (x1), "money" (x2)
# Class: Spam (C1), Not Spam (C2)
training_data = np.array([
    [1, 1, 1],  # Email 1: Spam
    [1, 0, 1],  # Email 2: Spam
    [0, 1, 1],  # Email 3: Spam
    [0, 0, 0],  # Email 4: Not Spam
    [0, 1, 0]   # Email 5: Not Spam
])

# Separate features and labels
X = training_data[:, :2]  # Features (x1, x2)
y = training_data[:, 2]    # Labels (1: Spam, 0: Not Spam)

# Step 2: Compute prior probabilities
def compute_priors(y):
    total_samples = len(y)
    p_c1 = np.sum(y == 1) / total_samples  # P(C1)
    p_c2 = np.sum(y == 0) / total_samples  # P(C2)
    return p_c1, p_c2

p_c1, p_c2 = compute_priors(y)
print(f"Prior Probabilities: P(C1) = {p_c1:.2f}, P(C2) = {p_c2:.2f}")

# Step 3: Compute likelihoods
def compute_likelihoods(X, y):
    # Initialize counts for each feature and class
    n_features = X.shape[1]
    likelihoods_c1 = np.zeros((n_features, 2))  # For C1 (Spam)
    likelihoods_c2 = np.zeros((n_features, 2))  # For C2 (Not Spam)
    
    # Count occurrences of each feature value for each class
    for feature in range(n_features):
        for cls in [1, 0]:
            if cls == 1:
                likelihoods_c1[feature, 1] = np.sum((X[:, feature] == 1) & (y == 1)) / np.sum(y == 1)
                likelihoods_c1[feature, 0] = 1 - likelihoods_c1[feature, 1]
            else:
                likelihoods_c2[feature, 1] = np.sum((X[:, feature] == 1) & (y == 0)) / np.sum(y == 0)
                likelihoods_c2[feature, 0] = 1 - likelihoods_c2[feature, 1]
    
    return likelihoods_c1, likelihoods_c2

likelihoods_c1, likelihoods_c2 = compute_likelihoods(X, y)
print("Likelihoods for C1 (Spam):")
print(likelihoods_c1)
print("Likelihoods for C2 (Not Spam):")
print(likelihoods_c2)

# Step 4: Classify a new email
def classify_email(new_email, p_c1, p_c2, likelihoods_c1, likelihoods_c2):
    # Compute posterior probabilities
    p_x_given_c1 = np.prod([likelihoods_c1[i, new_email[i]] for i in range(len(new_email))])
    p_x_given_c2 = np.prod([likelihoods_c2[i, new_email[i]] for i in range(len(new_email))])
    
    p_c1_given_x = p_x_given_c1 * p_c1
    p_c2_given_x = p_x_given_c2 * p_c2
    
    # Normalize probabilities
    p_c1_given_x /= (p_c1_given_x + p_c2_given_x)
    p_c2_given_x /= (p_c1_given_x + p_c2_given_x)
    
    # Decision rule
    if p_c1_given_x > p_c2_given_x:
        return "Spam (C1)", p_c1_given_x, p_c2_given_x
    else:
        return "Not Spam (C2)", p_c1_given_x, p_c2_given_x

# New email: "free" (1), "money" (1)
new_email = [1, 1]
classification, p_c1_given_x, p_c2_given_x = classify_email(new_email, p_c1, p_c2, likelihoods_c1, likelihoods_c2)
print(f"\nClassification Result: {classification}")
print(f"Posterior Probabilities: P(C1 | x) = {p_c1_given_x:.4f}, P(C2 | x) = {p_c2_given_x:.4f}")

#### Naive Bayes Classifier to a medical diagnosis problem
Explain why we don’t need to calculate the denominator of Bayes' rule in the Naive Bayes algorithm.

Problem: Medical Diagnosis - Scenario:
- We want to build a classifier to diagnose a disease based on the following features:
    - Fever: Yes (1) or No (0).
    - Cough: Yes (1) or No (0).
    - Fatigue: Yes (1) or No (0).

- The target variable is:
    - Diagnosis: Disease (1) or No Disease (0).

Dataset:
|Fever $(x_1)$	|Cough $(x_2)$	|Fatigue $(x_3)$	|Diagnosis ($y$)  |
|-------|-------|-----------|-----------|
|1	|1	|1	|1  |
|1	|0	|1	|1  |
|0	|1	|1	|1  |
|0	|0	|0	|0  |
|1	|0	|0	|0  |
|0	|1	|0	|0  |


**Step 1: Train the Naive Bayes Classifier**

The Naive Bayes Classifier calculates the posterior probability of each class given the features and assigns the class with the highest probability. The formula for the posterior probability is:

$$ P(C_i | x ) = \frac{P(x | C_i) \cdot P(C_i)}{P(x)} $$

However, in Naive Bayes, we don’t need to calculate the denominator $P(x)$ because:
- $P(x)$ is the same for all classes $C_i$.
- We are only interested in comparing the relative probabilities of the classes, not their absolute values.
- Therefore, we can ignore $P(x)$ and focus on the numerator:

$$ P(C_i | x) \alpha P(x|C_i) \cdot P(C_i) $$

**Step 2: Compute Prior Probabilities**

The prior probability $P(C_i)$ is the probability of each class in the training data.
- $P(\text{Disease}) = \frac{3}{6} = 0.5$
- $P(\text{No Disease}) = \frac{3}{6} = 0.5$

**Step 3: Compute Likelihoods**

The likelihood $P(x∣C_i)$ is the probability of observing the features given the class. Under the Naive Bayes assumption, the features are conditionally independent given the class, so:

$$ P(x| C_i) = P(x_1 | C_i) \cdot P(x_2 | C_i) \cdot P(x_3 | C_i) $$

Likelihoods for Disease (C_1):
- $P(x_1 | C_1) = \frac{2}{3}$
- $P(x_1 | C_1) = \frac{2}{3}$
- $P(x_1 | C_1) = \frac{3}{3} = 1$

Likelihoods for Disease ($C_2$):
- $P(x_1 | C_2) = \frac{1}{3}$
- $P(x_1 | C_2) = \frac{1}{3}$
- $P(x_1 | C_2) = \frac{0}{3} = 0$

**Step 4: Classify a New Sample**

Suppose we have a new patient with the following symptoms:
- Fever: Yes (1).
- Cough: Yes (1).
- Fatigue: Yes (1).

We want to classify this patient as having the disease or not.
- Compute $P(C_1∣ x) or P(Disease∣x)$:
$$ P(x| C_1) \alpha P(x| C_1) \cdot P(C_1)$$
$$ P(x| C_1) =  P(x_1 = 1| C_1) \cdot P(x_2 = 1| C_1) \cdot P(x_3 = 1| C_1) $$
$$ P(x| C_1) = \frac{2}{3} \cdot \frac{2}{3} \cdot 1 $$
$$ P(x| C_1) = \frac{4}{9} $$
$$ P(C_1∣x) \alpha \frac{4}{9} \cdot 0.5 = \frac{2}{9} $$

- Compute $P(C_2∣ x)$:
$$ P(x| C_2) \alpha P(x| C_2) \cdot P(C_2)$$
$$ P(x| C_2) =  P(x_1 = 1| C_2) \cdot P(x_2 = 1| C_2) \cdot P(x_3 = 1| C_2) $$
$$ P(x| C_2) = \frac{1}{3} \cdot \frac{1}{3} \cdot 0 $$
$$ P(x| C_2) = 0 $$
$$ P(C_2∣x) \alpha 0 \cdot 0.5 = 0 $$

Decision:
- $P(C_1|x) \alpha \frac{2}{9}$
- $P(C_2|x) \alpha 0 $
   - Since $\frac{2}{9} > 0$, the patient is classified as having the disease.

### **Why We Don’t Need to Calculate the Denominator**
- In Naive Bayes, the denominator $P(x)$ is the same for all classes $C_i$. Since we are only interested in comparing the relative probabilities of the classes, we can ignore $P(x)$ and focus on the numerator:
$$ P(x| C_1) \alpha P(x| C_1) \cdot P(C_1)$$
- This simplification makes the computation more efficient without affecting the final classification decision.

In [None]:
from sklearn.naive_bayes import BernoulliNB
import numpy as np

# Dataset
X = np.array([
    [1, 1, 1],  # Disease
    [1, 0, 1],  # Disease
    [0, 1, 1],  # Disease
    [0, 0, 0],  # No Disease
    [1, 0, 0],  # No Disease
    [0, 1, 0]   # No Disease
])
y = np.array([1, 1, 1, 0, 0, 0])  # 1: Disease, 0: No Disease

# Train Naive Bayes classifier
model = BernoulliNB()
model.fit(X, y)

# New patient: Fever = 1, Cough = 1, Fatigue = 1
new_patient = np.array([[1, 1, 1]])
prediction = model.predict(new_patient)
print("Prediction:", "Disease" if prediction[0] == 1 else "No Disease")

##### To extend the Bayesian classifier to handle more features, larger datasets, and continuous features, we need to make a few modifications to the previous implementation. 

Specifically:

Handling More Features:
- The classifier can handle any number of features by generalizing the likelihood calculation for each feature.

Larger Datasets:
- For larger datasets, we can use libraries like scikit-learn to efficiently compute probabilities and handle scalability.

Continuous Features:
- For continuous features, we assume they follow a probability distribution (e.g., Gaussian) and compute the likelihood using the probability density function (PDF).

##### Key Differences from the Previous Implementation
Feature Types:
- The previous implementation assumed binary features.
- This extended version handles both categorical and continuous features.

Probability Distributions:
- For continuous features, we use the Gaussian (Normal) distribution to model the likelihood.

Scalability:
- The extended version is designed to handle larger datasets and more features efficiently.

Example Usage
Dataset
Let’s create a synthetic dataset with:
- 2 continuous features (e.g., "age", "income").
- 1 categorical feature (e.g., "education level").
- Binary classes (e.g., "buy" or "not buy").

In [None]:
import numpy as np
from scipy.stats import norm

def compute_priors(y):
    classes, counts = np.unique(y, return_counts=True)
    priors = {cls: count / len(y) for cls, count in zip(classes, counts)}
    return priors, classes

def is_continuous(feature_values):
    return len(np.unique(feature_values)) > 10  # Arbitrary threshold for continuous feature

def compute_likelihoods(X, y, classes):
    n_features = X.shape[1]
    likelihoods = {}
    
    for feature in range(n_features):
        likelihoods[feature] = {}
        for cls in classes:
            feature_values = X[y == cls, feature]
            if is_continuous(feature_values):
                mean, std = np.mean(feature_values), np.std(feature_values)
                likelihoods[feature][cls] = {"distribution": "gaussian", "mean": mean, "std": std}
            else:
                unique_values, counts = np.unique(feature_values, return_counts=True)
                probabilities = counts / np.sum(counts)
                likelihoods[feature][cls] = {"distribution": "categorical", "values": unique_values, "probabilities": probabilities}
    
    return likelihoods

def fit_bayesian_classifier(X, y):
    priors, classes = compute_priors(y)
    likelihoods = compute_likelihoods(X, y, classes)
    return priors, likelihoods, classes

def predict_bayesian_classifier(X, priors, likelihoods, classes):
    predictions = []
    
    for sample in X:
        posteriors = {}
        for cls in classes:
            posterior = priors[cls]
            for feature in range(len(sample)):
                likelihood_params = likelihoods[feature][cls]
                if likelihood_params["distribution"] == "gaussian":
                    mean, std = likelihood_params["mean"], likelihood_params["std"]
                    likelihood = norm.pdf(sample[feature], loc=mean, scale=std)
                else:
                    value_index = np.where(likelihood_params["values"] == sample[feature])[0]
                    likelihood = likelihood_params["probabilities"][value_index[0]] if len(value_index) > 0 else 0
                posterior *= likelihood
            posteriors[cls] = posterior
        predictions.append(max(posteriors, key=posteriors.get))
    
    return np.array(predictions)


# Synthetic dataset
np.random.seed(42)
X = np.array([
    [25, 50000, 1],  # Age, Income, Education Level
    [30, 60000, 2],
    [35, 70000, 1],
    [40, 80000, 3],
    [45, 90000, 2],
    [50, 100000, 3]
])
y = np.array([0, 0, 1, 1, 0, 1])  # 0: Not Buy, 1: Buy

# Initialize and fit the classifier
classifier = predict_bayesian_classifier()
classifier.fit(X, y)

# Predict for new samples
new_samples = np.array([
    [28, 55000, 1],  # New sample 1
    [38, 75000, 2]   # New sample 2
])
predictions = classifier.predict(new_samples)
print("Predictions:", predictions)

##### Explanation of the Extended Implementation
Handling Continuous Features:
- For continuous features, we assume a Gaussian distribution and compute the likelihood using the probability density function (PDF) of the normal distribution.

Handling Categorical Features:
- For categorical features, we compute the likelihood as the probability of observing the feature value given the class.

Scalability:
- The implementation is designed to handle larger datasets and more features efficiently.

Flexibility:
- The classifier can handle mixed feature types (continuous and categorical) in the same dataset.

##### Key Differences from the Previous Implementation
Feature Types:
- The previous implementation only handled binary features.
- The extended version handles both continuous and categorical features.

Probability Distributions:
- The extended version uses Gaussian distributions for continuous features and categorical distributions for discrete features.

Generalization:
- The extended version is more general and can be applied to a wider range of datasets.

### **Applying a Naive Bayes classifier to a Decision Tree problem**
This involves using the Naive Bayes algorithm to solve a classification task that could also be addressed using a Decision Tree. While Decision Trees and Naive Bayes are fundamentally different algorithms, they can both be applied to the same dataset for comparison or specific use cases.

Below, I'll demonstrate how to apply a Naive Bayes classifier to a dataset typically used for Decision Tree problems, such as the Iris dataset or a synthetic dataset for binary classification.

**Problem: Binary Classification with Synthetic Dataset**

Create a synthetic dataset for binary classification and apply the Naive Bayes classifier to it. This dataset could also be used for a Decision Tree problem.

Step 1: Create a Synthetic Dataset

In [None]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Generate a synthetic dataset
X, y = make_classification(
    n_samples=1000,  # 1000 samples
    n_features=4,    # 4 features
    n_classes=2,     # Binary classification
    random_state=42
)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Step 2: Train a Naive Bayes Classifier

We'll use the Gaussian Naive Bayes classifier, which assumes that the features follow a Gaussian (normal) distribution.

In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Initialize the Naive Bayes classifier
model = GaussianNB()

# Train the model
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")

# Classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Confusion matrix
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Interpretation
- Accuracy: The model achieves an accuracy of 93.33% on the test set.
- Precision and Recall: Both precision and recall are high for both classes, indicating that the model performs well in distinguishing between the two classes.

Confusion Matrix:
- True Positives (TP): 139
- True Negatives (TN): 141
- False Positives (FP): 11
- False Negatives (FN): 9

##### **Comparison with Decision Trees**
Key Differences

Model Type:
- Naive Bayes is a probabilistic model based on Bayes' Theorem.
- Decision Trees are non-parametric models that split the data based on feature values.

Assumptions:
- Naive Bayes assumes conditional independence of features given the class label.
- Decision Trees make no such assumptions and can capture complex interactions between features.

Interpretability:
- Naive Bayes is less interpretable but computationally efficient.
- Decision Trees are highly interpretable, as they provide a clear decision path.

Performance:
- Naive Bayes performs well with small datasets and high-dimensional data.
- Decision Trees can handle both small and large datasets but may overfit if not properly regularized.

Step 4: Train a Decision Tree for Comparison

Let's train a Decision Tree classifier on the same dataset for comparison.

In [None]:
from sklearn.tree import DecisionTreeClassifier

# Initialize the Decision Tree classifier
tree_model = DecisionTreeClassifier(random_state=42)

# Train the model
tree_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred_tree = tree_model.predict(X_test)

# Evaluate the model
accuracy_tree = accuracy_score(y_test, y_pred_tree)
print(f"Decision Tree Accuracy: {accuracy_tree:.4f}")

# Classification report
print("\nDecision Tree Classification Report:")
print(classification_report(y_test, y_pred_tree))

# Confusion matrix
print("\nDecision Tree Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_tree))

### **When the target variable is continuous, the problem is a regression task rather than a classification task.**
The criterion for determining the optimal feature choice and its threshold is different from classification tasks. The goal is to split the data in a way that minimizes the variance of the target variable within each subset after the split.

##### **Criterion for Regression: Variance Reduction**
In regression tasks, the most common criterion for determining the optimal feature and its threshold is **variance reduction**. 
- The idea is to split the data such that the variance of the target variable within each subset is minimized. This is often measured using the Sum of Squared Errors (SSE) or Mean Squared Error (MSE).

Steps to Determine the Optimal Feature and Threshold
- Calculate the Variance of the Target Variable:
    - Compute the variance of the target variable y for the entire dataset.

- Evaluate All Possible Splits:
    - For each feature, sort its values and consider all possible thresholds to split the data into two subsets.
    - For each candidate split, calculate the variance of the target variable in the left and right subsets.

Compute the Weighted Variance:
- Compute the weighted variance for each split:
$$ Weighted Variance = \frac{n_{left}}{n} \cdot Var(y_{left}) + \frac{n_{right}}{n} \cdot Var(y_{right}) $$
- where:
    - $n_{left}$ and $n_{right}$ are the number of samples in the left and right subsets, respectively.
    - n is the total number of samples.
    - $Var(y_{left})$ and $Var(y_{right})$ are the variances of the target variable in the left and right subsets.

Choose the Split with the Lowest Weighted Variance:
- The optimal split is the one that results in the lowest weighted variance.

Repeat for All Features:
- Repeat the above steps for all features and choose the feature and threshold that provide the best split.

##### Mathematical Formulation
The variance reduction for a split S is defined as:
$$ Varaiance Reduction = Var(y) - (\frac{n_{left}}{n} \cdot Var(y_{left}) + \frac{n_{right}}{n} \cdot Var(y_{right})) $$

The goal is to maximize the variance reduction, which is equivalent to minimizing the weighted variance.

##### Example: Applying Variance Reduction
Dataset
- Consider a dataset with one feature x and a continuous target variable y:

**Step 1: Calculate the Variance of y**
- Mean of y: $\mu = \frac{2+3+4+5+6}{5} = 5$
- Variance of y: $Var(y) = \frac{(2-4)^2 + (3-4)^2 + (4-4)^2 + (5-4)^2 + (6-4)^2 }{5} = 2$

**Step 2: Evaluate All Possible Splits**
- Possible thresholds for x: 1.5, 2.5, 3.5, 4.5

For each threshold:

Threshold = 1.5:
- Left subset: $x \leq 1.5 → y=[2]$
- Right subset: $x > 1.5 → y=[3,4,5,6]$
- Weighted variance: $\frac{1}{5} \cdot 0 + \frac{4}{5} \cdot 1.25 = 1$

Threshold = 2.5:
- Left subset: $x \leq 2.5 → y=[2, 3]$
- Right subset: $x > 2.5 → y=[4,5,6]$
- Weighted variance: $\frac{2}{5} \cdot 0.25 + \frac{3}{5} \cdot 0.6667 = 0.5$

Threshold = 3.5:
- Left subset: $x \leq 3.5 → y=[2,3,4]$
- Right subset: $x > 3.5 → y=[5,6]$
- Weighted variance: $\frac{3}{5} \cdot 0.6667 + \frac{2}{5} \cdot 0.25 = 0.5$

Threshold = 4.5:
- Left subset: $x \leq 4.5 → y=[2,3,4,5]$
- Right subset: $x > 4.5 → y=[6]$
- Weighted variance: $\frac{4}{5} \cdot 1.25 + \frac{1}{5} \cdot 0 = 1$

**Step 3: Choose the Optimal Split**

The splits with thresholds 2.5 and 3.5 result in the lowest weighted variance (0.5).

Either of these splits is optimal.

##### Explanation of the Code
Variance Calculation:
- The variance function computes the variance of the target variable y.

Finding the Best Split:
- The find_best_split function iterates over all unique values of the feature X as potential thresholds.
- For each threshold, it calculates the weighted variance of the left and right subsets.
- The split with the highest variance reduction is selected as the best split.

Result:
- The best threshold is 2, and the corresponding variance reduction is 1.0000.

In [None]:
import numpy as np

# Dataset
X = np.array([1, 2, 3, 4, 5])
y = np.array([2, 3, 4, 5, 6])

# Function to calculate variance
def variance(y):
    return np.var(y)

# Function to find the best split
def find_best_split(X, y):
    n = len(y)
    best_var_reduction = -np.inf
    best_threshold = None
    
    for threshold in np.unique(X):
        left_indices = X <= threshold
        right_indices = X > threshold
        
        y_left = y[left_indices]
        y_right = y[right_indices]
        
        if len(y_left) == 0 or len(y_right) == 0:
            continue
        
        weighted_var = (len(y_left) / n) * variance(y_left) + (len(y_right) / n) * variance(y_right)
        var_reduction = variance(y) - weighted_var
        
        if var_reduction > best_var_reduction:
            best_var_reduction = var_reduction
            best_threshold = threshold
    
    return best_threshold, best_var_reduction

# Find the best split
best_threshold, best_var_reduction = find_best_split(X, y)
print(f"Best Threshold: {best_threshold}")
print(f"Best Variance Reduction: {best_var_reduction:.4f}")

### **In Naive Bayes method, What is lift**
Is a metric used to evaluate the effectiveness of a predictive model. It measures how much better the model performs compared to a random guess or a baseline model. Lift is particularly useful in scenarios like marketing campaigns, where we want to identify the most promising targets based on predicted probabilities.

##### **What is Lift?**
Lift is defined as the ratio of the target response rate when using the model to the baseline response rate (i.e., the response rate without the model). Mathematically, it is expressed as:
$$ \text{Lift} = \frac{P(\text{Target|Method})}{P(\text{Target})} $$
- Where:
    - $P(\text{Target|Method})$:  The probability of the target event occurring given the model's predictions.
    - $P(\text{Target})$: The baseline probability of the target event occurring in the population.

Interpretation of Lift
- Lift > 1: The model performs better than random guessing.
- Lift = 1: The model performs no better than random guessing.
- Lift < 1: The model performs worse than random guessing.

##### **Steps to Calculate Lift**
Train the Naive Bayes Model:
- Train the Naive Bayes classifier on the training data.

Predict Probabilities:
- Use the trained model to predict the probabilities of the target class for each instance in the test data.

Sort Instances by Predicted Probability:
- Sort the test instances in descending order of their predicted probabilities.

Divide into Deciles or Groups:
- Divide the sorted instances into groups (e.g., deciles or percentiles).

Calculate Lift for Each Group:
- For each group, calculate the lift as:
$$ \text{Lift} = \frac{\text{Response Rate in Group}}{\text{Overall Response Rate}} $$

##### Example: Calculating Lift in a Marketing Campaign
Problem Statement
- Suppose we are running a marketing campaign to target customers likely to purchase a product. We have a dataset with the following features:
    - Age: Continuous.
    - Income: Continuous.
    - Purchased: Binary (1 = Purchased, 0 = Did Not Purchase).
- We want to calculate the lift of the Naive Bayes model to evaluate its effectiveness.

**Step 1: Train the Naive Bayes Model**
- We train a Naive Bayes classifier on the dataset to predict the probability of a customer purchasing the product.

**Step 2: Predict Probabilities**
- For each customer in the test set, the model predicts the probability of purchasing the product.

**Step 3: Sort Instances by Predicted Probability**
- Sort the test instances in descending order of their predicted probabilities.

**Step 4: Divide into Deciles**
- Divide the sorted instances into 10 equal groups (deciles).

**Step 5: Calculate Lift for Each Decile**
- For each decile, calculate the lift as:

$$ \text{Lift} = \frac{\text{Response Rate in Decile}}{\text{Overall Response Rate}} $$

Dataset
|Customer	|Predicted Probability	|Actual Purchase|
|-----------|-----------------------|---------------|
|1	|0.95	|1    |
|2	|0.90	|1    |
|3	|0.85	|1    |
|4	|0.80	|0    |
|5	|0.75	|1    |
|6	|0.70	|0    |
|7	|0.65	|0    |
|8	|0.60	|0    |
|9	|0.55	|0    |
|10	|0.50	|0    |

Overall Response Rate
- Total purchases: 3.
- Total customers: 10.
- Overall response rate: $\frac{3}{10}$ = 0.3

Lift for Top Decile
- Customers in the top decile: Customers 1, 2, 3.
- Purchases in the top decile: 3.
- Response rate in the top decile: $\frac{3}{3}$ = 1
- lift: $\frac{1}{0.3} \approx 3.33$

In [None]:
import numpy as np
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split

# Synthetic dataset
np.random.seed(42)
X = np.random.rand(100, 2)  # Features: Age, Income
y = np.random.randint(0, 2, 100)  # Target: Purchased (0 or 1)

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Naive Bayes classifier
model = GaussianNB()
model.fit(X_train, y_train)

# Predict probabilities for the test set
y_pred_proba = model.predict_proba(X_test)[:, 1]  # Probability of class 1 (Purchase)

# Sort test instances by predicted probability
sorted_indices = np.argsort(y_pred_proba)[::-1]
y_test_sorted = y_test[sorted_indices]

# Divide into deciles
n_deciles = 10
decile_size = len(y_test_sorted) // n_deciles
lift_values = []

for i in range(n_deciles):
    start = i * decile_size
    end = (i + 1) * decile_size
    y_decile = y_test_sorted[start:end]
    response_rate_decile = np.mean(y_decile)
    overall_response_rate = np.mean(y_test)
    lift = response_rate_decile / overall_response_rate
    lift_values.append(lift)

# Print lift values
for i, lift in enumerate(lift_values):
    print(f"Decile {i + 1}: Lift = {lift:.2f}")

### **Generative method and Naive-Naive Bayes**
A generative method is a type of machine learning model that learns the joint probability distribution P(X,Y) of the input features X and the target labels Y. Once the joint distribution is learned, the model can generate new data samples that resemble the training data. Generative models are used for tasks like:
- Data generation (e.g., generating realistic images or text).
- Density estimation (e.g., estimating the probability distribution of data).
- Classification (e.g., predicting the class label for a given input).

Key Characteristics of Generative Models

Joint Probability Distribution:
- Generative models learn $P(X,Y)$, which can be factored as:
$$ P(X,Y) = P(X|Y) \cdot P(Y)$$
- Here, $P(X∣Y)$ is the likelihood of the features given the class, and $P(Y)$ is the prior probability of the class.

Data Generation:
- Once the model learns $P(X,Y)$, it can generate new samples $(x,y)$ by sampling from this distribution.

Examples of Generative Models:
- Naive Bayes.
- Gaussian Mixture Models (GMM).
- Hidden Markov Models (HMM).
- Generative Adversarial Networks (GANs).
- Variational Autoencoders (VAEs).

##### **Naive Bayes as a Generative Model**
Naive Bayes is a classic example of a generative model. It assumes that the features X are conditionally independent given the class Y. The joint probability distribution is modeled as:
$$ P(X,Y) = P(Y) \cdot \Pi_{j=1}^n P(X_j|Y)$$
- Where:
    - $P(Y)$: Prior probability of the class.
    - $P(X_j|Y)$: Likelihood of the j-th feature given the class.

Steps in Naive Bayes

Learn the Prior $P(Y)$:
- Estimate the probability of each class from the training data.

Learn the Likelihood $P(X_j∣Y)$:
- For each feature $X_j$, estimate its probability distribution given the class.

Predict the Posterior $P(Y∣X)$:
- Use Bayes' Theorem to compute the posterior probability of the class given the features:
$$ P(Y|X) \alpha P(Y) \cdot \Pi_{j=1}^n P(X_j|Y)$$

Generate New Samples:
- Sample from the learned joint distribution $P(X,Y)$ to generate new data.

##### **Naive-Naive Bayes**
The term Naive-Naive Bayes is not a standard term in machine learning literature. However, it could refer to a hierarchical or ensemble approach where multiple Naive Bayes models are combined or applied in a layered manner. For example:
- Hierarchical Naive Bayes: A two-level Naive Bayes model where the first level predicts a coarse-grained class, and the second level refines the prediction.
- Ensemble Naive Bayes: Combining multiple Naive Bayes models (e.g., using bagging or boosting) to improve performance.

Alternatively, it might refer to a simplified or "naive" version of Naive Bayes where additional assumptions are made to further simplify the model.

#### Example: Naive Bayes as a Generative Model

build a Naive Bayes classifier to predict whether an email is spam or not spam based on the presence of two keywords: Email Spam detection with two features 
- $x_1$: free, 
- $x_2$: money

**Step 1: Learn the Prior $P(Y)$**
- $P(C_1) = \frac{3}{6} = 0.5 \text{ (Spam).} $
- $P(C_2) = \frac{3}{6} = 0.5 \text{ (Not Spam).} $

**Step 2: Compute Likelihoods**

We compute the likelihoods $P(x_j | C_i)$ for each feature and class.
- For $C_1$ (Spam):
    - $P(x_1 = 1 | C_1) = \frac{2}{3} \quad \text{(``free'' appears in 2 out of 3 spam emails)}$
    - $P(x_1 = 0 | C_1) = \frac{1}{3}.$
    - $P(x_2 = 1 | C_1) = \frac{2}{3} \quad \text{(``money'' appears in 2 out of 3 spam emails).}$
    - $P(x_2 = 0 | C_1) = \frac{1}{3}.$
- For $C_2$ (Not Spam):
    - $P(x_1 = 1 | C_2) = \frac{1}{3} \quad \text{(``free'' does not appear in not spam emails).}$
    - $P(x_2 = 1 | C_2) = \frac{1}{3} \quad \text{(``money'' appears in 1 out of 2 not spam emails).}$

**Step 3: Predict the Posterior probability $P(Y|X)$ - Classify a New Email**
- For $C_1$ (Spam):
$$P(C_1 | x) \propto P(x_1 = 1 | C_1) \cdot P(x_2 = 1 | C_1) \cdot P(C_1)$$
$$P(C_1 | x) \propto \frac{2}{3} \times \frac{2}{3} \times 0.5 $$
$$= \frac{4}{9} \times 0.5$$
$$= 0.2222$$

- For $C_2$ (Not Spam):
$$P(C_2 | x) \propto P(x_1 = 1 | C_2) \cdot P(x_2 = 1 | C_2) \cdot P(C_2)$$
$$P(C_2 | x) \propto \frac{1}{3} \times \frac{1}{3} \times 0.5 $$
$$P(C_2 | x) = \frac{1}{18}$$

Since $P(C_1 | x) = \frac{2}{9} > P(C_2 | x) = \frac{1}{18}$, the email is classified as $\text{spam}$.

**Step 4: Generate New Samples**
- Sample from $P(Y)$ to generate a class label.
- Sample from $P(X_j ∣Y)$ to generate feature values for the sampled class.

In [None]:
from sklearn.naive_bayes import BernoulliNB
import numpy as np

# Dataset
X = np.array([
    [1, 1],  # Spam
    [1, 0],  # Spam
    [0, 1],  # Spam
    [0, 0],  # Not Spam
    [1, 0],  # Not Spam
    [0, 1]   # Not Spam
])
y = np.array([1, 1, 1, 0, 0, 0])  # 1: Spam, 0: Not Spam

# Train Naive Bayes classifier
model = BernoulliNB()
model.fit(X, y)

# Predict probabilities for a new email
new_email = np.array([[1, 1]])  # "free" = 1, "money" = 1
predicted_prob = model.predict_proba(new_email)
print("Predicted Probabilities:", predicted_prob)

# Generate new samples
n_samples = 5
generated_samples = []
for _ in range(n_samples):
    y_gen = np.random.choice([0, 1], p=model.class_prior_)  # Sample class
    x_gen = [np.random.choice([0, 1], p=model.feature_log_prob_[y_gen, j].exp()) for j in range(X.shape[1])]
    generated_samples.append((x_gen, y_gen))

print("\nGenerated Samples:")
for sample in generated_samples:
    print(f"Features: {sample[0]}, Class: {sample[1]}")

### Advantages and Dissadvantages of Naive Bayes Classifier
Naive Bayes Classifier is a popular and widely used machine learning algorithm, especially for classification tasks. However, like any algorithm, it has its strengths and weaknesses. Below is a detailed list of the advantages and disadvantages of the Naive Bayes Classifier.

#### Advantages of Naive Bayes Classifier
Simplicity:
- Naive Bayes is easy to understand and implement.
- It requires minimal training data to estimate the parameters (prior probabilities and likelihoods).

Computational Efficiency:
- It is computationally efficient, both in terms of training and prediction.
- The algorithm scales well with large datasets and high-dimensional data.

Handles High-Dimensional Data:
- Naive Bayes performs well with datasets that have a large number of features (e.g., text data with thousands of words).

Works Well with Small Datasets:
- It can produce reliable results even with limited training data.

Robust to Irrelevant Features:
- Naive Bayes is relatively robust to irrelevant features because it assumes conditional independence between features.

Handles Missing Data:
- It can handle missing data by ignoring the missing values during probability estimation.

Probabilistic Output:
- It provides probabilistic predictions, which can be useful for decision-making (e.g., ranking predictions by confidence).

No Need for Feature Scaling:
- Naive Bayes does not require feature scaling (e.g., normalization or standardization) because it is based on probability distributions.

Works Well for Text Classification:
- It is particularly effective for text classification tasks like spam filtering, sentiment analysis, and document categorization.

Interpretable:
- The model's predictions are based on simple probability calculations, making it interpretable.

#### Disadvantages of Naive Bayes Classifier
Strong Independence Assumption:
- The assumption that features are conditionally independent given the class label is often unrealistic in real-world data.
- This can lead to suboptimal performance when features are correlated.

Sensitive to Feature Distributions:
- Naive Bayes assumes that features follow specific probability distributions (e.g., Gaussian for continuous features).
- If the data does not conform to these assumptions, the model's performance may suffer.

Zero Probability Problem:
- If a feature value does not appear in the training data for a particular class, the likelihood $P(X_j∣Y)$ becomes zero, causing the posterior probability to be zero.
- This can be mitigated using techniques like Laplace smoothing.

Limited Expressiveness:
- Naive Bayes is a simple model and may not capture complex relationships between features and the target variable.

Bias from Prior Probabilities:
- The model's predictions can be biased if the prior probabilities $P(Y)$ are not estimated accurately.

Not Suitable for Regression:
- Naive Bayes is designed for classification tasks and cannot be directly applied to regression problems.

Difficulty with Continuous Features:
- While Naive Bayes can handle continuous features using Gaussian distributions, it may not perform well if the data does not follow a normal distribution.

Overfitting with Small Datasets:
- Although Naive Bayes works well with small datasets, it can overfit if the dataset is too small or if there are too many features.

Dependence on Feature Engineering:
- The performance of Naive Bayes can be heavily influenced by the quality of feature engineering (e.g., handling missing values, encoding categorical variables).

Limited to Probabilistic Models:
- Naive Bayes is limited to probabilistic models and may not be suitable for tasks that require non-probabilistic approaches.

#### When to Use Naive Bayes
Naive Bayes is a good choice when:
- The dataset is small or medium-sized.
- The features are conditionally independent or nearly independent.
- The problem involves text classification or high-dimensional data.
0 Interpretability and computational efficiency are important.

#### When Not to Use Naive Bayes
Avoid using Naive Bayes when:
- The features are highly correlated.
- The data does not conform to the assumed probability distributions.
- The problem requires capturing complex relationships between features and the target variable.
- The task involves regression or non-probabilistic predictions.

|Aspect	|Advantages	|Disadvantages |
|-------|-----------|--------------|
|Simplicity	|Easy to understand and implement.	|Limited expressiveness; may not capture complex relationships.|
|Efficiency	|Computationally efficient; scales well with large datasets.	|Sensitive to feature distributions; may not handle non-Gaussian data well.|
|High-Dimensional Data	|Performs well with high-dimensional data (e.g., text).	|Strong independence assumption; may not work well with correlated features.|
|Small Datasets	|Works well with limited training data.	|Overfitting risk with very small datasets.|
|Robustness	|Robust to irrelevant features; handles missing data.	|Zero probability problem; requires Laplace smoothing for unseen feature values.|
|Probabilistic Output	|Provides probabilistic predictions.	|Not suitable for regression tasks.|
|Interpretability	|Interpretable due to simple probability calculations.	|Bias from inaccurate prior probabilities.|

# K-Nearest Neighbors (KNN)
K-Nearest Neighbors (KNN) is a non-parametric, instance-based learning algorithm used for both classification and regression tasks. It makes predictions based on the similarity between data points, meaning it relies on a majority vote (for classification) or an average of nearest neighbors (for regression).

Definition:
The nearest neighbors of a given data point are the K closest training samples determined by a distance metric (usually Euclidean distance).

How It Works in KNN:
- A new data point is classified based on the majority class of its nearest neighbors.
- The number of neighbors (K) is a hyperparameter that determines model behavior.

How KNN Algorithm Works:
- Choose K (the number of nearest neighbors).
- Compute the distance between the query point and all other points in the dataset (commonly using Euclidean distance).
- Find the K nearest neighbors based on the computed distances.
- Make a prediction:
    - For classification: Assign the class that is most frequent among the K neighbors (majority vote).
    - For regression: Predict the average (or weighted average) of the target values of K neighbors.

Significance in KNN:
- Too few neighbors (low K) → High variance (prone to noise).
- Too many neighbors (high K) → High bias (over-smooth decision boundaries).

### KNN algorithm
K-nearest neighbors (KNN) is a powerful, yet easy to understand machine learning algorithm. In principle, this algorithm works by assigning the majority class of the N closest neighbors to the currect data point. As such, absolutely no training is required for the algorithm! All we do is choose K (i.e. the number of neighbors to consider), choose a distance function to calculate proximity and we're good to go. 

A typical KNN algorithm works as follows:

1. Choose K (number of neighbours)
2. Choose distance metric, e.g. Euclidean distance: 
$$
d(X^a,X^b) = \sqrt{(x^{a}_{1}-x^{b}_{1})^2 + (x^{a}_{2}-x^{b}_{2})^2 + \cdots + (x^{a}_{n}-x^{b}_{n})^2}
$$ 
3. For each data point $X_{test}$ in the testing data do:
    - For each data point $X_{train}$ in the training data do
        - Calculate the distance $d(X_{train},X_{test})$ between the test point and training observation.
    - Find labels of the K closest data points to $X_{test}$
    - Assign most frequent (i.e. the mode) class label to $X_{test}$

In the context of regression, we would use the mean of the K nearest neighbors instead of the mode of the class labels.

### KNN classification using sklearn
Fit the model

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, accuracy_score, log_loss

from sklearn.neighbors import KNeighborsClassifier # Importing the kNN Classifier class from sklearn using eucliend distance

import warnings
warnings.filterwarnings("ignore")

# Fit the model
n_neighbors = 3 # <--- change this number to play around with how many nearest neighbours to look for.
knn = KNeighborsClassifier(n_neighbors) 
knn.fit(X_train, y_train)


# Fit the model 2
knn = KNeighborsClassifier(n_neighbors = 3)  # k-NN classifier with k = 3
knn.fit(X_train, y_train)

##### Testing the model: Assess model performance
Like before, let's have a look at the log loss

In [None]:
# Get predictions on the test set 
y_hat = knn.predict_proba(X_test)
# Calculate the loss 
print("The log loss error for our model is: ", log_loss(y_test, y_hat))


y_pred = knn.predict(X_test)

##### Evaluating the model

In [None]:
conf_mat = confusion_matrix(y_test, y_pred, n_class=3)
conf_mat

# Accuracy
accuracy = np.trace(conf_mat)/np.sum(conf_mat)
print("Correctly identified: {} and total: {}".format(np.trace(conf_mat), np.sum(conf_mat)))
print("Accuracy with kNN and k = 3 : {:0.2f}%".format(accuracy*100))

##### Plot the decision boundary

We once again visualise the decision boundary of our KNN classifier. Remember to change `i` and `j` to compare different features to one another. 

In [None]:
i = 0; j = 2
knn.fit(X[:, [i, j]], y)
fig = plt.figure(figsize=(8, 8))
ax1 = fig.add_subplot(111)

x_min, x_max = X[:, i].min(), X[:, i].max()
y_min, y_max = X[:, j].min(), X[:, j].max()
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 1000), np.linspace(y_min, y_max, 1000))

y_hat = knn.predict(np.concatenate((xx.reshape(-1,1), yy.reshape(-1,1)), axis=1))
y_hat = y_hat.reshape(xx.shape)

ax1.pcolormesh(xx, yy, y_hat, cmap=plt.cm.get_cmap('RdBu_r'))
ax1.scatter(X[:, i], X[:, j], c=y, edgecolors='k', cmap=plt.cm.get_cmap('RdBu_r'))
ax1.set_xlabel('Feature '+str(i))
ax1.set_ylabel('Feature '+str(j))
ax1.set_xlim(xx.min(), xx.max())
ax1.set_ylim(yy.min(), yy.max())
ax1.set_xticks(())
ax1.set_yticks(())
plt.show()

### K-Nearest Neighbour Classification from Scratch

##### importing necessary libraries and load data

In [None]:
import numpy as np
from sklearn.datasets import load_iris

X,y = load_iris(return_X_y=True)

##### Splitting the entire dataset into train and test

In [None]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(X,y,test_size = 0.2)
print(x_train.shape, x_test.shape)

##### Standardizing the dataset

In [None]:
from sklearn.preprocessing import StandardScaler #normalisng dataset unit variance 1 and o mean 

SC = StandardScaler()
X_train = SC.fit_transform(X_train)
X_test = SC.fit_transform(X_test)

##### Pairwise distances : How it works

In [None]:
from sklearn.metrics import pairwise_distances

pts_1 = np.asarray([[0, 1],[1, -1]])
pts_2 = np.asarray([[0, 0], [2, 0]])

pairwise_distances(pts_1, pts_2)

##### kNN Classification Algorithm

In [None]:
def kNN_classifier(X_train, y_train, n_class, X_test, k = 3): #by default the value is 3
    
    n_test_pt = X_test.shape[0]
    D = pairwise_distances(X_test, X_train)
    y_pred = np.empty(n_test_pt)
    
    for i in range(n_test_pt):
        neighbors = np.argsort(D[i,:])[1:k+1] #selecting row i and all the columns starting from 1 to k+1 to exclude 0
        labels_neigh = y_train[neighbors]
        count = np.zeros(n_class)
        for j in labels_neigh:
            count[j] += 1
        y_pred[i] = np.argmax(count)
    
    return y_pred

# or
def KNN(input_):
    col_index = 1
    ls = []
    k = int(input("Please enter your k value: "))
    for i,row in df_.iterrows():
        # Calculating Distance
        dist = np.sqrt((row["Temperature"]-input_[0])**2 + (row["Humidity"]-input_[1])**2)
        # Appending the to empty list
        ls.append([i, dist, row["Occupancy"]])
        # converting the list into a numpy array
    np_ls = np.array(ls)
    sorted_np_ls = np_ls[np_ls[:,col_index].argsort()]
    top5 = sorted_np_ls[:k,:]
    mode_ = stats.mode(top5[:,2])
    return top5, mode_

KNN([20,18])[1]
KNN([20,18])[0]

# or
def knn_array(df,input_):
    ls = []
    for i,row in df.iterrows():
        dist = np.sqrt((row["Temperature"]-input_[0])**2 + (row["Humidity"]-input_[1])**2)
        ls.append([i, dist, row["Occupancy"]])
        np_ls = np.array(ls)
    return np_ls

def sort_array():
    col_index = 1
    sorted_np_ls = knn_arr[knn_arr[:,col_index].argsort()]
    return sorted_np_ls

def top_k(arr):
    k = int(input("Please enter your k value: "))
    top_k = arr[:k,:]
    return top_k

def mode(arr):
    mode_ = stats.mode(arr[:,2])

knn_array(df_,[20,18])

##### Prediction using KNN Classifier

In [None]:
y_pred = kNN_classifier(X_train, y_train, n_class = 3, X_test = X_test, k = 3) #K can change
y_pred.shape

##### Defining confusion matrix function and Evaluating the model

In [None]:
def confusion_matrix(y_true, y_pred, n_class):
    
    conf_mat = np.zeros((n_class, n_class),dtype='int')
    for i in range(y_true.shape[0]):
        conf_mat[int(y_true[i]), int(y_pred[i])] += 1
    
    return conf_mat


cm = confusion_matrix(y_test, y_pred, n_class=3)
cm

In [None]:
accuracy = np.trace(cm)/np.sum(cm)

print("Correctly identified: {} and total: {}".format(np.trace(cm), np.sum(cm)))
print("Accuracy with kNN and k = 3 : {:0.2f}%".format(accuracy*100))

### Example: Occupancy Detection

In this practical session, we will design and implement supervised learning method(s) for detecting occupancy of an office room. The dataset we will use is from [Luis M. Candanedo, Veronique Feldheim, "Accurate occupancy
detection of an office room from light, temperature, humidity and CO2 measurements using statistical learning models", Energy and Buildings, Volume 112, 15 January 2016, Pages 28-39](https://doi.org/10.1016/j.enbuild.2015.11.071)


##### Dataset

The dataset has a training set of 8143 examples and a test set of 2665 examples. Each example is comprised of features (acquired through sensors from an office room) and the corresponding target value (**Occupancy**). The features (**Temperature**, **Humidty**, **Light**, **CO2**, **Humidity_Ratio**) recorded for each example to predict the state of the office room (**Occupancy**).

##### Loading data

The following code segment loads both training and testing data from text files `../data/trainingdata.txt` and `../data/testingdata.txt`, respectively. Each row of 2D NumPy arrays `training_data` and `testing_data` respectively refers to an example from the training and testing datasets. The last column of each row refers to the target value, either 0 or 1 respectively representing **Unoccupied** or **Occupied** room.

In [None]:
import numpy as np
import matplotlib.pyplot as pl

# load training data
training_data = np.loadtxt('../data/trainingdata.txt', usecols=(2,3,4,5,6,7), skiprows=1, delimiter=',')
x_training = training_data[:, :-1]
y_training = training_data[:, -1]

# load testing data
testing_data = np.loadtxt('../data/testingdata.txt', usecols=(2,3,4,5,6,7), skiprows=1, delimiter=',')
x_testing = testing_data[:, :-1]
y_testing = testing_data[:, -1]

# feature names and their indexes on the 2D NumPy array
feature_index_names = {0:'Temperature', 1:'Humidity', 2:'Light', 3:'CO2', 4:'Humidity_Ratio'}

# statistics of training and testing datasets
print('training data:')
print('{:_<24s} = {:d}'.format('number of samples', y_training.shape[0]))
print('{:_<24s} = {:d}'.format('number of zeros', np.sum(y_training == 0)))
print('{:_<24s} = {:d}'.format('number of ones', np.sum(y_training == 1)))

print('testing data:')
print('{:_<24s} = {:d}'.format('number of samples', y_testing.shape[0]))
print('{:_<24s} = {:d}'.format('number of zeros', np.sum(y_testing == 0)))
print('{:_<24s} = {:d}'.format('number of ones', np.sum(y_testing == 1)))

### Feature selection
Some features may be irrelevant. Let's test KNN performance with different feature sets.

Key Insight:
- Feature selection removes noise and improves accuracy.

##### Scatter plot

Scatter plotting is a great tool in order to identify _discriminative_ features. The following function performs scatter plotting of pairwise features.

In [None]:
def scatter_plot(x, y, feature_index_names, save_figures=False):
    c = ['r', 'b']
    m = ['s', 'o']
    s = [32, 32]
    l = [r'$0$', r'$1$'] 
    font_size = 22
    x_class0 = x[y==0.0, :]
    x_class1 = x[y==1.0, :]    
    for i in np.arange(len(feature_index_names)-1):
        for j in np.arange(i+1, len(feature_index_names)):
            figure_name = feature_index_names[i] + ' vs ' + feature_index_names[j]
            pl.figure(figure_name)
            pl.scatter(x_class0[:, i], x_class0[:, j], c=c[0], marker=m[0], s=s[0], label=l[0])
            pl.scatter(x_class1[:, i], x_class1[:, j], c=c[1], marker=m[1], s=s[1], label=l[1])
            pl.xlabel(feature_index_names[i], fontsize=font_size)
            pl.ylabel(feature_index_names[j], fontsize=font_size)
            pl.legend(scatterpoints=1, fontsize=font_size, loc = 'upper right')
            pl.title(figure_name, fontsize=font_size)
            pl.tight_layout()
            if save_figures:
                pl.savefig(figure_name + '.png')

scatter_plot(x_training, y_training, feature_index_names)

In [None]:
from sklearn.feature_selection import SelectKBest, f_classif

# Select best features
selector = SelectKBest(score_func=f_classif, k=1)
X_new = selector.fit_transform(X_train, y_train)

# Train KNN with selected features
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_new, y_train)
print("Accuracy with best feature:", accuracy_score(y_test, knn.predict(selector.transform(X_test))))


### The sample correlation coeffient

Based on the scatter plots, it may be a good idea to choose **Light** and **CO2** as our discriminative features to predict the state of the office room. But, can we somehow quantify this? How about using _sample correlation coefficient_ between features and target varget values? Let $x_{i}^{(n)}$ denote $i$th feature of example $n$, and $y^{(n)}$ denote the corresponding target value. The _sample correlation coefficient_ between $x_{i}$ and $y$ is defined as follows
$$
r\left(x_{i}, y\right) = \frac{\sum\limits_{n=1}^{N} \left(x_{i}^{(n)} - \hat{x}_{i}\right) \left(y^{(n)} - \hat{y}\right)}{\sqrt{\sum\limits_{n=1}^{N} \left(x_{i}^{(n)} - \hat{x}_{i}\right)^{2} \sum\limits_{n=1}^{N} \left(y^{(n)} - \hat{y}\right)^{2}}},
$$
where $\hat{x}_{i} = \frac{1}{N}\sum\limits_{n=1}^{N}x_{i}^{(n)}$ is the sample mean of $x_{i}$, and analogously for $\hat{y}$. 

The following function `corr_coef` and code segment computes and displays the sample correlation coeffients between features and target values.

In [None]:
def corr_coef(x, y):
    x_m = (x - np.mean(x))
    y_m = (y - np.mean(y))
    r = np.sum(x_m * y_m) / np.sqrt(np.sum(x_m**2) * np.sum(y_m**2))
    return r

y = y_training.copy()
for i in np.arange(len(feature_index_names)):
    x_i = x_training[:, i]
    r = corr_coef(x_i, y)    
    print('r(' + feature_index_names[i] + ', Occupancy) = ' + '{:.4f}'.format(r))

Observation:
- observe from the sample correlation coefficients between features and target values, **Light** and **CO2** are highly correlated to the state of the office room. Did you expect this? Based on the sample correlation, can you find out which features are highly correlated? Use the following code segment to compute the sample correlation coefficients between features.

In [None]:
for i in np.arange(len(feature_index_names)-1):
    for j in np.arange(i+1, len(feature_index_names)):
        x_i = x_training[:, i]
        x_j = x_training[:, j]
        r = corr_coef(x_i, x_j)
        print('r(' + feature_index_names[i] + ', ' + feature_index_names[j] + ') = ' + '{:.4f}'.format(r))

The column indexes of **Light** and **CO2** in `x_training` and `x_testing` arrays are 2 and 3, respectively. The following code segment creates a copy of training and testing datasets comprised of only features **Light** and **CO2**.

In [None]:
x_training_selected = x_training[:,[2, 3]]
x_testing_selected = x_testing[:,[2, 3]]
print(x_training_selected.shape)
print(x_testing_selected.shape)

#### K-Nearest Neighbours: Finding and plotting K-Nearest Neighbours of the query point

Assume that we have a sample measurement (or query point, or test point) from the office room 
 $\mathbf{x}^{\left(q\right)} = \left[\begin{array}{c} x_{1}^{\left(q\right)} \\ x_{2}^{\left(q\right)}\end{array}\right] = \left[\begin{array}{l} \text{Light} \\ \text{CO}_{2}\end{array}\right] = \left[\begin{array}{l} 200.0000 \\ 1000.0000 \end{array}\right]$, and we want to predict if the room is occupied or unoccupied. If the number of features is less than or equal to 3, then we can either use 2D or 3D scatter plot to observe the proximity of the query point to the examples in our training dataset to make prediction about the room status. The following function and code segment do the scatter plot and shows K-Nearest Neighbours of the query point from the training dataset.

In [None]:
def scatter_plot_two_features_only(x, y, feature1_name, feature2_name, x_q, K=3, save_figures=False):
    c = ['r', 'b', 'g']
    m = ['s', 'o', '*']
    s = [32, 32, 64]
    l = [r'$0$', r'$1$', r'query']
    font_size = 22
    x_class0 = x[y==0.0, :]
    x_class1 = x[y==1.0, :]    
    figure_name = feature1_name + ' vs ' + feature2_name + ' Query'
    pl.figure(figure_name)
    pl.scatter(x_class0[:, 0], x_class0[:, 1], c=c[0], marker=m[0], s=s[0], label=l[0])
    pl.scatter(x_class1[:, 0], x_class1[:, 1], c=c[1], marker=m[1], s=s[1], label=l[1])
    pl.scatter(x_q[0], x_q[1], c=c[2], marker=m[2], s=s[2], label=l[2])
    # find the nearest neighbours
    d = np.sqrt(np.sum((x - x_q)**2, axis=1)) # use the Euclidean distance
    i = np.argsort(d)
    for k in np.arange(K):
        temp = np.vstack((x_q, x[i[k],:]))
        pl.plot(temp[:,0], temp[:,1], c = c[2], linewidth=2)    
    pl.xlabel(feature1_name, fontsize=font_size)
    pl.ylabel(feature2_name, fontsize=font_size)
    pl.legend(scatterpoints=1, fontsize=font_size, loc = 'upper right')
    pl.title(figure_name, fontsize=font_size)
    pl.tight_layout()
    if save_figures:
        pl.savefig(figure_name + '.png')

In [None]:
K = 11
x_q = np.array([200.0, 1000.0])
x_q.shape
scatter_plot_two_features_only(x_training_selected, y_training, feature_index_names[2], feature_index_names[3], x_q, K=K)

What is the prediction for the query point $\mathbf{x}^{\left(q\right)} = \left[\begin{array}{c} x_{1}^{\left(q\right)} \\ x_{2}^{\left(q\right)}\end{array}\right] = \left[\begin{array}{l} \text{Light} \\ \text{CO}_{2}\end{array}\right] = \left[\begin{array}{l} 200.0000 \\ 1000.0000 \end{array}\right]$ when $K=111$? In order to make prediction for $K=111$, you can use the following function and code segment.

In [None]:
def KNN(x, y, x_q, K=3):
    target_labels = np.unique(y) # unique set of target labels
    target_labels_counts = np.zeros(len(target_labels)) # keeps counts of target labels
    d = np.sqrt(np.sum((x - x_q)**2, axis=1)) # use the Euclidean distance
    i = np.argsort(d) # sort distance vector in ascending order
    for k in np.arange(len(target_labels)):
        target_labels_counts[k] = np.sum(y[i[0:K]]==target_labels[k]) # count the number of each target label in K Nearest Neighbourhood
    # apply the majority voting
    l = np.argmax(target_labels_counts)
    return target_labels[l]    

In [None]:
predicted_label = KNN(x_training_selected, y_training, x_q, K=111)
print('prediction = ' + '{:.0f}'.format(predicted_label))

#### Calculating the Training error for different values of K
The following code segment computes training error for different values of $K$.

In [None]:
N = y_training.shape[0] # the number of examples in training dataset
print('{:>2} {:>16}'.format('K', 'Training Error'))
for K in np.arange(1, 34, 2):
    y_prediction = np.zeros(N)
    for n in np.arange(N):
        x_q = x_training_selected[n]
        y_prediction[n] = KNN(x_training_selected, y_training, x_q, K)
    classification_error = np.sum(y_training != y_prediction) / N
    print('{:>2.0f} {:>16.4f}'.format(K, classification_error))

Can you compute testing error of the classifier for $K = 111$?

### Relationship Between KNN and Bayes Classifier
The Bayes classifier is the theoretical optimal classifier that assigns a data point to the class with the highest posterior probability, given the feature values:
$$ P(Y= c | X = x) = \frac{P(X = x | Y = c) P(Y = c)}{P(X = x)} $$

KNN can be seen as an approximation to the Bayes classifier.
- When K=1, KNN assigns a class based on the closest single observation, leading to high variance and possible misclassification.
- As K increases, KNN averages over more points, providing a smoother approximation to the Bayes classifier.

Impact of the Choice of 𝐾
- The choice of K significantly affects the performance of KNN:
    - Small K (e.g., 1 or 3)
        - Very sensitive to noise (high variance).
        - Can lead to overfitting (memorizes the dataset).
        - May not generalize well.
    - Large K (e.g., 20 or 50)
        - Reduces variance and smooths decision boundaries.
        - Can lead to underfitting if too large.
        - Bias increases as 𝐾 moves towards the majority class in imbalanced datasets.

KNN Bias-Variance Tradeoff
- Small 𝐾 → Low bias, high variance (fits the data too closely, not generalizable).
- Large 𝐾 → High bias, low variance (loses detailed structure, may misclassify complex patterns).
- Optimal 𝐾 balances bias and variance, often chosen using cross-validation.

### Conditional Probability in K-Nearest Neighbors (KNN)

In KNN classification, the probability that a point 𝑥 belongs to class 𝑐 is estimated using the proportion of class 𝑐 points among the 𝐾 nearest neighbors.

Mathematically, the conditional probability of a point 𝑥 belonging to class 𝑐 is given by:
$$ P(Y = c | X = x) = \frac{\text{Number of neighbors in class c}}{K} $$
- Where:
    - $P(Y=c∣X=x)$ is the estimated probability that 𝑥 belongs to class 𝑐.
    - 𝐾 is the number of nearest neighbors.
    - The numerator is the number of neighbors within 𝐾 that belong to class 𝑐.

Example Calculation
- Suppose we have a dataset with two classes:
    - Class 0
    - Class 1
- We use K = 5 (5-nearest neighbors). If among the 5 neighbors:
    - 3 belong to Class 1
    - 2 belong to Class 0

Then the probability estimates are:
$$ P(Y = 1| X = x) = \frac{3}{5} = 0.6 $$
$$ P(Y = 0| X = x) = \frac{2}{5} = 0.4 $$

The classifier then assigns the class with the highest probability—in this case, Class 1.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

# Generate a synthetic dataset
X, y = make_classification(n_samples=200, n_features=2, n_classes=2, random_state=42)

# Split dataset into training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit KNN classifier with K=5
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

# Pick a sample point from the test set
sample_index = 0
sample_point = X_test[sample_index].reshape(1, -1)

# Predict class probabilities
probabilities = knn.predict_proba(sample_point)

print(f"Estimated Probability of Class 0: {probabilities[0][0]:.4f}")
print(f"Estimated Probability of Class 1: {probabilities[0][1]:.4f}")


Interpreting the Output
- The model calculates the proportion of K nearest neighbors in each class.
- The class with the highest probability is assigned to the new data point.
- If probabilities are close (e.g., 0.51 vs. 0.49), the decision boundary is unclear, and increasing 𝐾 might help smooth the decision function.

## Calcualte Probability of belonging to a class
- **Simple Probability**: Proportion of neighbors belonging to each class.
- **Weighted Probability**: Proportion of weights (inverse distances) for each class.
    - Use weighted probabilities when closer neighbors should have a stronger influence on the classification decision.

### Calculate Class Probabilities
In K-Nearest Neighbors (KNN) classification, you can calculate the probability of belonging to a class by examining the proportion of the nearest neighbors that belong to each class. This is particularly useful when you want to understand the confidence or uncertainty of the classification decision.

Steps to Calculate Class Probabilities
- Identify the k nearest neighbors of the test point.
- Count the number of neighbors that belong to each class.
- Divide the count for each class by k to get the probability.

Formula
- For a test point x, the probability $P(y = c|x)$ of belonging to class c is:
$$ P(y = c|x) = \frac{\text{Number of neighbors of class c}}{k}$$

### Example
Suppose k=5, and the classes of the nearest neighbors are:
- Neighbor 1: Class A
- Neighbor 2: Class A
- Neighbor 3: Class B
- Neighbor 4: Class A
- Neighbor 5: Class B

The probabilities are:
- $P(y = A|x) = \frac{3}{5} = 0.6$
- $P(y = B|x) = \frac{2}{5} = 0.4$

In [None]:
from collections import Counter

# Example: Nearest neighbors' classes
neighbor_classes = ['A', 'A', 'B', 'A', 'B']
k = len(neighbor_classes)  # Number of neighbors

# Count the occurrences of each class
class_counts = Counter(neighbor_classes)

# Calculate probabilities
class_probabilities = {cls: count / k for cls, count in class_counts.items()}

print("Class Probabilities:", class_probabilities)

### Calculate Weigted Class Probabilities
If you want to incorporate distances into the probability calculation (e.g., weighted KNN), you can use the inverse of the distances as weights. The probability of belonging to a class is then the sum of the weights for that class divided by the total sum of weights.

Formula:
$$ P(y = c|x) = \frac{\sum_{i \in \text{neighbors of class c}} \cdot \text{weight}_i }{\sum^k_{i = 1} \text{weight}_i}$$
- where:
    - $\text{weight}_i = \frac{1}{\text{distance}_i}$

Weighted Probability Example
- Suppose k=5, and the distances and classes of the nearest neighbors are:
    - Neighbor 1: Distance = 1.0, Class = A
    - Neighbor 2: Distance = 2.0, Class = A
    - Neighbor 3: Distance = 3.0, Class = B
    - Neighbor 4: Distance = 1.5, Class = A
    - Neighbor 5: Distance = 2.5, Class = B

The weights are:
- Class A: $\frac{1}{1.0} + \frac{1}{2.0} + \frac{1}{1.5} = 1 + 0.5 + 0.67 = 2.17$
- Class B: $\frac{1}{3.0} + \frac{1}{2.5} = 0.33 + 0.4 = 0.73$
    - The total weight is $2.17+0.73=2.90$

The probabilities are:
- $P(y = A|x) = \frac{2.17}{2.90} \approx 0.75$
- $P(y = B|x) = \frac{0.73}{2.90} \approx 0.25$

In [None]:
# Example: Distances and classes of nearest neighbors
distances = [1.0, 2.0, 3.0, 1.5, 2.5]
classes = ['A', 'A', 'B', 'A', 'B']

# Calculate weights
weights = [1 / dist for dist in distances]

# Sum weights for each class
class_weights = {'A': 0, 'B': 0}
for cls, weight in zip(classes, weights):
    class_weights[cls] += weight

# Total weight
total_weight = sum(class_weights.values())

# Calculate probabilities
class_probabilities = {cls: weight / total_weight for cls, weight in class_weights.items()}

print("Weighted Class Probabilities:", class_probabilities)

### KNN and the Bias-Variance Trade-off
The choice of K (number of neighbors) in K-Nearest Neighbors (KNN) has a direct impact on the bias-variance trade-off:

##### What is Overfitting in KNN?
Overfitting occurs when a model learns the noise or specific details of the training data, resulting in poor generalization to unseen data. In KNN:
- A very small k (e.g., k=1) makes the model highly sensitive to noise and outliers in the training data, leading to overfitting.
- A very large k (e.g., k=N, where N is the total number of training samples) makes the model too simplistic, leading to underfitting.

|K Value|	Bias|	Variance|	Overfitting/Underfitting|
|-------|-------|-----------|---------------------------|
|Small 𝐾 (e.g., 1, 3, 5)|	Low|	High|	Overfitting (captures noise)|
|Medium 𝐾 (e.g., 10-20)|	Moderate|	Moderate|	Balanced trade-off|
|Large 𝐾 (e.g., 50, 100, total dataset size)|	High|	Low|	Underfitting (oversmooths decision boundaries)|

### How k Controls Overfitting
The value of k determines how many neighbors contribute to the prediction:
- Small k:
    - The model considers only a few nearest neighbors.
    - This makes the decision boundary more complex and flexible, capturing fine-grained patterns in the training data.
    - However, it also makes the model sensitive to noise and outliers, increasing the risk of overfitting.
- Large k:
    - The model considers many neighbors, smoothing out the decision boundary.
    - This reduces the model's sensitivity to noise and outliers, making it more robust.
    - However, it may oversimplify the decision boundary, leading to underfitting.

### Effect of 𝐾 on Bias and Variance
The choice of k directly affects the bias-variance trade-off:

Small 𝐾 (low bias, high variance)
- Low Bias: The model fits the training data closely.
- High Variance: The model is sensitive to small fluctuations in the training data.
- Decision boundary is very flexible.
- Fits training data well but is sensitive to noise.
- High variance because small changes in data can lead to drastic changes in predictions.

Large 𝐾 (high bias, low variance)
- High Bias: The model may oversimplify the data, leading to underfitting.
- Low Variance: The model is less sensitive to noise and outliers.
- Smoother decision boundary, making strong assumptions about the structure of the data.
- More stable predictions but less responsive to fine-grained patterns.
- High bias because it ignores local details.

By selecting an appropriate k, you can balance bias and variance to achieve good generalization.

Finding the Optimal 𝐾
- Cross-validation is commonly used to find an optimal 𝐾 that balances bias and variance.
    - valuate the model's performance (e.g., accuracy, F1-score) for different values of k using cross-validation.
- Typically, an elbow method (plotting accuracy vs. 𝐾) helps determine the best 𝐾.
    - Plot the error rate (e.g., misclassification rate) against k and choose the k where the error rate stabilizes or starts to increase.

Example: Impact of k on Decision Boundaries
Consider a binary classification problem with two classes. The decision boundaries for different values of k might look like this:
- k = 1: The decision boundary is highly irregular, capturing every detail of the training data (overfitting).
- k = 5: The decision boundary is smoother but still captures the general structure of the data.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

# Generate a synthetic dataset (moons for non-linearity)
X, y = make_moons(n_samples=300, noise=0.2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Function to plot decision boundaries
def plot_knn_decision_boundary(k_values, X_train, y_train):
    plt.figure(figsize=(15, 5))
    
    for i, k in enumerate(k_values):
        knn = KNeighborsClassifier(n_neighbors=k)
        knn.fit(X_train, y_train)
        
        # Create grid
        x_min, x_max = X_train[:, 0].min() - 0.5, X_train[:, 0].max() + 0.5
        y_min, y_max = X_train[:, 1].min() - 0.5, X_train[:, 1].max() + 0.5
        xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200),
                             np.linspace(y_min, y_max, 200))
        Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
        Z = Z.reshape(xx.shape)

        # Plot
        plt.subplot(1, len(k_values), i + 1)
        plt.contourf(xx, yy, Z, alpha=0.3)
        plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, edgecolors='k')
        plt.title(f'K = {k}')

    plt.show()

# Test different K values
plot_knn_decision_boundary([1, 5, 15, 50], X_train, y_train)

Expected Observations:
- K = 1 → High variance (overfitting)
    - The model perfectly memorizes the training data.
    - Decision boundary is irregular, capturing noise.
- K = 5 → Good balance
    - Some smoothness in the boundary, reducing overfitting.
- K = 15 → Reduced variance
    - Smoother decision boundaries.
    - Less sensitivity to small changes.
- K = 50 → High bias (underfitting)
    - Over-smooths the decision boundary.
    - Fails to capture real class structures.

### Finding the Optimal 𝐾 Using Cross-Validation
To select the best 𝐾, we use cross-validation to measure model performance across different values of 𝐾. The goal is to choose 𝐾 that minimizes test error while avoiding overfitting and underfitting.

Finding Optimal 𝐾
We will:
- Train KNN models with different values of 𝐾.
- Use cross-validation to estimate performance.
- Plot the error rate vs. 𝐾 to find the best value.

Expected Observations
- Small 𝐾 (e.g., 1-3) → High variance, unstable performance (overfitting).
- Large 𝐾 (e.g., 20-30) → High bias, smooth decision boundary (underfitting).
- Optimal 𝐾 is usually in the mid-range (e.g., 5-15).

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.neighbors import KNeighborsClassifier

# Generate synthetic dataset
X, y = make_moons(n_samples=300, noise=0.2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Range of K values to test
k_values = range(1, 31)
cv_scores = []

# Perform cross-validation for each K
for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, X_train, y_train, cv=5, scoring='accuracy')  # 5-fold cross-validation
    cv_scores.append(scores.mean())  # Store average accuracy

# Find optimal K
optimal_k = k_values[np.argmax(cv_scores)]
print(f"Optimal K: {optimal_k}")

# Plot accuracy vs K
plt.figure(figsize=(8, 5))
plt.plot(k_values, cv_scores, marker='o', linestyle='dashed', color='b')
plt.xlabel("K (Number of Neighbors)")
plt.ylabel("Cross-Validation Accuracy")
plt.title("Optimal K Selection using Cross-Validation")
plt.axvline(optimal_k, linestyle='--', color='r', label=f'Optimal K = {optimal_k}')
plt.legend()
plt.show()

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
import matplotlib.pyplot as plt

# Load dataset
data = load_iris()
X, y = data.data, data.target

# Test different values of k
k_values = range(1, 30)
cv_scores = []

for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, X, y, cv=5, scoring='accuracy')
    cv_scores.append(scores.mean())

# Plot the results
plt.plot(k_values, cv_scores, marker='o')
plt.xlabel('k')
plt.ylabel('Cross-Validation Accuracy')
plt.title('k vs. Accuracy')
plt.show()

# Optimal k
optimal_k = k_values[cv_scores.index(max(cv_scores))]
print("Optimal k:", optimal_k)

### Effect of Increasing 𝐾 on the Decision Boundary in KNN Classifier
The decision boundary in K-Nearest Neighbors (KNN) depends on how the model classifies points based on their nearest neighbors. As the value of 𝐾 increases, the decision boundary undergoes significant changes:

1️⃣ Small 𝐾 (e.g., 𝐾=1 or 𝐾=3) → Complex, Highly Flexible Boundaries
- Each data point has a strong influence on classification.
- The decision boundary is highly irregular and follows the training data very closely.
- It captures small details and noise, leading to overfitting.
- Predictions are highly sensitive to minor changes in the data.

💡 Example:
- With 𝐾=1, each test point is classified based on only its nearest neighbor, leading to a jagged, irregular boundary.
- If the training data has noise, this noise gets incorporated into the decision boundary.

2️⃣ Moderate 𝐾 (e.g., 𝐾=5 to 𝐾=15) → Balanced Boundary
- The decision boundary is smoother and less sensitive to individual points.
- Classification is based on a group of neighbors, reducing sensitivity to outliers.
- It provides a good trade-off between capturing patterns and avoiding noise.

💡 Example:
- With 𝐾=10, the classifier considers a more generalized decision rule.
- The boundary is still flexible but avoids extreme fluctuations.

3️⃣ Large 𝐾 (e.g., 𝐾=50 or more) → Overly Smooth Boundaries
- The model considers many neighbors, making the decision boundary very smooth.
- It starts treating the data as one large region, losing finer distinctions.
- This leads to underfitting, where important details in the data structure are ignored.
- The classifier behaves more like a global majority vote, losing local patterns.
💡 Example:
- With 𝐾=100, the model essentially assigns the most common class to all points, leading to an overly simplified decision boundary.

##### Visualization of Decision Boundaries

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

# Generate dataset
X, y = make_moons(n_samples=300, noise=0.2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Function to plot decision boundaries
def plot_knn_decision_boundary(k_values, X_train, y_train):
    plt.figure(figsize=(15, 5))
    
    for i, k in enumerate(k_values):
        knn = KNeighborsClassifier(n_neighbors=k)
        knn.fit(X_train, y_train)
        
        # Create grid
        x_min, x_max = X_train[:, 0].min() - 0.5, X_train[:, 0].max() + 0.5
        y_min, y_max = X_train[:, 1].min() - 0.5, X_train[:, 1].max() + 0.5
        xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200),
                             np.linspace(y_min, y_max, 200))
        Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
        Z = Z.reshape(xx.shape)

        # Plot
        plt.subplot(1, len(k_values), i + 1)
        plt.contourf(xx, yy, Z, alpha=0.3)
        plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, edgecolors='k')
        plt.title(f'K = {k}')

    plt.show()

# Test different K values
plot_knn_decision_boundary([1, 5, 15, 50], X_train, y_train)

KNN Classifier and Impact of K
- Let's train a KNN classifier and observe how the decision boundary changes with different values of K.

Key Takeaways:
- Low K (e.g., K=1) → Highly flexible, captures noise (high variance, low bias).
- Medium K (e.g., K=5) → Good balance between bias and variance.
- High K (e.g., K=15) → Smoother decision boundary, less sensitive to noise (high bias, low variance).

In [None]:
def plot_knn_decision_boundary(X, y, k=3):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X, y)
    
    # Plot decision boundary
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100), np.linspace(y_min, y_max, 100))
    Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    plt.contourf(xx, yy, Z, alpha=0.3, cmap='coolwarm')
    plt.scatter(X[:, 0], X[:, 1], c=y, cmap='coolwarm', edgecolor='k')
    plt.title(f"KNN Decision Boundary (K={k})")
    plt.show()

# Visualize for different K values
for k in [1, 5, 15]:
    plot_knn_decision_boundary(X_train, y_train, k)


## Distance Measures | Metrics in KNN
Understanding distance metrics is crucial because KNN relies on calculating distances between data points to determine their similarity. Below, I will explain the most commonly used distance metrics, their intuition, use cases

##### **Euclidean Distance (Default in KNN)**
Definition: Euclidean distance is the straight-line distance between two points in Euclidean space. It is the most common distance metric and works well for continuous numerical data.

Significance in KNN:
- Measures the straight-line distance.
    - Used to measure similarity between data points.
- Produces smooth round decision boundaries (good for continuous, well-spaced data).
- Sensitive to differences in scale.
    - feature normalization is crucial

Formula:
- For two points $𝑃(𝑥_1, 𝑦_1)$ and 𝑄(𝑥_2, 𝑦_2), the Euclidean distance is:
$$ d(P,Q) = \sqrt{ (x_1 - x_2)^2 + (y_1 - y_2)^2} $$

- For two points in N-dimentional:
    - $P = p_1, p_2, ..., p_n$
    - $Q = q_1, q_2, ..., q_n$
$$ d(P,Q) = \sqrt{\sum^n_{i =1} (p_i - q_i)^2} $$

Intuition:
- Euclidean distance measures the "as-the-crow-flies" distance between two points. It is sensitive to the magnitude of differences in each dimension.

Use Case:
- Works well for low-dimensional data.
- Suitable for data where all features are on the same scale.

In [None]:
import numpy as np

# Example: Distance between two points in a 2D space
P = np.array([1, 2])
Q = np.array([4, 6])

# Calculate Euclidean distance
euclidean_distance = np.linalg.norm(P - Q)
print("Euclidean Distance:", euclidean_distance)

##### **Manhattan Distance (L1 Norm | L1 distance)**
Definition: Manhattan distance is the sum of the absolute differences between the coordinates of two points. It is also known as "city block" distance. It measures the distance between two points by summing the absolute differences of their coordinates. It is called "Manhattan distance" because it resembles the path a taxi would take in a grid-like city like Manhattan.
- Measures distance by summing absolute differences.
- Creates grid-like decision boundaries.
- Works well when features have different units or when movements are constrained (e.g., city block movement).

Formula:
- For two points $P=(p_1, p_2, …,p_n)$ and $Q=(q_1, q_2, …,q_n)$, the Manhattan distance is:
$$ d(P,Q) = \sum^n_{i = 1} |p_i - q_i| $$

Intuition:
- Manhattan distance measures the distance traveled along axes at right angles, like moving through a grid (e.g., city blocks).

Interpretation:
- Manhattan distance is robust to outliers because it uses absolute differences.
- It is suitable for high-dimensional data and categorical/binary features.
- It measures the distance along axes at right angles, like moving through a grid.

Use Case:
- Useful for high-dimensional data.
- Suitable for data with categorical or binary features.

Example:
- Let P=(1,2) and Q=(4,6). The Manhattan distance is:
$$d(P,Q)=∣1−4∣+∣2−6∣=3+4=7$$

In [None]:
# Calculate Manhattan distance
manhattan_distance = np.sum(np.abs(P - Q))
print("Manhattan Distance:", manhattan_distance)

In [None]:
import numpy as np

# Points
P = np.array([1, 2])
Q = np.array([4, 6])

# Manhattan distance
manhattan_distance = np.sum(np.abs(P - Q))
print("Manhattan Distance:", manhattan_distance)

##### **Minkowski Distance (Generalization of Euclidean and Manhattan)**
Definition: Minkowski distance is a generalization of Euclidean and Manhattan distances. It introduces a parameter p that allows it to behave like other distances.

Intuition:
- If 𝑝=1, it's Manhattan Distance.
- If 𝑝=2, it's Euclidean Distance.
- For other values of p, it can model different types of distances.

Formula:
$$ d(P,Q) = (\sum |p_i - q_i|^p)^{\frac{1}{p}} $$

Use Case:
- Flexible for different types of data.
- Useful when you want to experiment with different distance metrics.

In [None]:
# Calculate Minkowski distance with p=3
p = 3
minkowski_distance = np.power(np.sum(np.power(np.abs(P - Q), p)), 1/p)
print("Minkowski Distance (p=3):", minkowski_distance)

##### **Chebyshev Distance**
Definition: Chebyshev distance is the maximum absolute difference between the coordinates of two points.
- Measures only the largest difference in any dimension.
- Results in square-shaped decision boundaries.

Formula:
$$ d(P,Q) = \max_i |p_i - q_i| $$

Intuition:
- Chebyshev distance measures the greatest difference in any single dimension. It is like moving like a king in chess (any number of squares in any direction).

Use Case:
- Useful in games or grid-based systems.
- Suitable for data where the maximum difference is important.

In [None]:
# Calculate Chebyshev distance
chebyshev_distance = np.max(np.abs(P - Q))
print("Chebyshev Distance:", chebyshev_distance)

##### **Mahalanobis Distance (Accounts for Correlation Between Features)**
Definition: Mahalanobis distance measures the distance between a point and a distribution, taking into account the covariance between variables.
- Uses the covariance matrix 𝑆 to measure distances.
- Handles correlated features better.
- Requires estimating the covariance matrix, making it computationally expensive.

Formula:
$$ d(P,Q) = \sqrt{ (p_i - q_i)^T S^{-1} (p - q)} $$
- where S is the covariance matrix.

Intuition:
- Mahalanobis distance accounts for the scale and correlation of the data. It is useful when features are correlated or have different scales.

Use Case:
- Suitable for multivariate data with correlated features.
- Useful in outlier detection.

In [None]:
from scipy.spatial.distance import mahalanobis

# Example data
P = np.array([1, 2])
Q = np.array([4, 6])
data = np.array([P, Q])

# Calculate covariance matrix and its inverse
cov_matrix = np.cov(data, rowvar=False)
inv_cov_matrix = np.linalg.inv(cov_matrix)

# Calculate Mahalanobis distance
mahalanobis_distance = mahalanobis(P, Q, inv_cov_matrix)
print("Mahalanobis Distance:", mahalanobis_distance)

##### **Cosine Similarity**
Definition: Cosine similarity measures the cosine of the angle between two vectors. It is often used to measure similarity in high-dimensional spaces. It is derived from cosine similarity, which measures the orientation (angle) of the vectors rather than their magnitude. Cosine distance is defined as:
$$ \text{Cosine Distance} = 1 - \text{Cosine Similarity} $$

Formula:
$$ \text{Cosine Similarity} = \frac{P \cdot Q}{||P|| ||Q||} $$
- where:
    - $P \cdot Q$:  is the dot product of P and Q.
    - ||P|| ||Q||: are the magnitudes (Euclidean norms) of P and Q.

The cosine distance is:
$$ \text{Cosine Distance} = 1 - \frac{P \cdot Q}{||P|| ||Q||} $$

Intuition:
- Cosine similarity focuses on the orientation (angle) of the vectors rather than their magnitude. It is useful for text data or sparse data.

Interpretation:
- Cosine distance is useful for high-dimensional data, such as text data (e.g., TF-IDF vectors).
- It focuses on the orientation of the vectors, making it insensitive to magnitude.
- A cosine distance of 0 means the vectors are identical (angle = 0°), while a distance of 1 means they are orthogonal (angle = 90°).

Use Case:
- Suitable for text data (e.g., TF-IDF vectors).
- Useful for high-dimensional sparse data.

Key Differences Between Manhattan and Cosine Distance

|Metric|	Interpretation|	Use Case|	Sensitivity to Magnitude|
|------|------------------|---------|---------------------------|
|Manhattan|	Sum of absolute differences.|	High-dimensional, categorical/binary data.|	Sensitive.|
|Cosine|	Angle between vectors.|	Text data, high-dimensional sparse data.|	Insensitive.|

##### When to Use Each Distance

Manhattan Distance:
- Use when the magnitude of differences in each dimension is important.
- Suitable for grid-like structures or when outliers should not dominate the distance.

Cosine Distance:
- Use when the orientation (angle) of the vectors is more important than their magnitude.
- Ideal for text data or high-dimensional sparse data.

Example:
- Let P=(1,2) and Q=(4,6). The cosine distance is calculated as follows:
    - Dot product: 
        - $P⋅Q=(1×4)+(2×6)=4+12=16$
    - Magnitudes:
        - ||P|| = $\sqrt{1^2 + 2^2} = \sqrt{1 + 4} = \sqrt{5}$
        - ||Q|| = $\sqrt{4^2 + 6^2} = \sqrt{16 + 36} = \sqrt{52}$
    - Cosine similarity:
        - $ \text{Cosine Similarity} = \frac{16}{\sqrt{5} \times \sqrt{52}} \approx \frac{16}{16.12} \approx 0.992 $
    - Cosine distance:
        - Cosine Distance= $1− 0.992 = 0.008$

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# Calculate cosine similarity
cos_sim = cosine_similarity([P], [Q])
print("Cosine Similarity:", cos_sim[0][0])

In [None]:
from sklearn.metrics.pairwise import cosine_distances

# Points
P = np.array([1, 2]).reshape(1, -1)
Q = np.array([4, 6]).reshape(1, -1)

# Cosine distance
cosine_distance = cosine_distances(P, Q)
print("Cosine Distance:", cosine_distance[0][0])

##### **Hamming Distance**
Definition: Hamming distance measures the number of positions at which the corresponding symbols are different. It is used for categorical or binary data.

Formula:
$$ d(P,Q) = \sum^n_{i = 1} 1(p_i \neq q_i) $$
- where 1 is the indicator function.

Intuition:
- Hamming distance counts the number of mismatches between two vectors.

Use Case:
- Suitable for binary or categorical data.
- Useful in error detection and correction.

In [None]:
from scipy.spatial.distance import hamming

# Example binary data
P = np.array([1, 0, 1, 1])
Q = np.array([0, 0, 1, 0])

# Calculate Hamming distance
hamming_distance = hamming(P, Q)
print("Hamming Distance:", hamming_distance)

|Distance Metric|	Use Case |
|---------------|------------|
|Euclidean Distance|	Low-dimensional, continuous data.|
|Manhattan Distance|	High-dimensional, categorical/binary data.|
|Minkowski Distance|	Flexible, customizable distance metric.|
|Chebyshev Distance|	Grid-based systems, maximum difference matters.|
|Mahalanobis Distance|	Multivariate data with correlated features.|
|Cosine Similarity|	Text data, high-dimensional sparse data.|
|Hamming Distance|	Binary or categorical data.|

In [None]:
point1 = np.array([2, 3])
point2 = np.array([5, 7])

# Euclidean Distance
print("Euclidean Distance:", euclidean(point1, point2))

# Manhattan Distance
print("Manhattan Distance:", cityblock(point1, point2))

# Jaccard Distance (for binary feature vectors)
binary_vec1 = np.array([1, 0, 1, 0, 1])
binary_vec2 = np.array([1, 1, 0, 0, 1])
print("Jaccard Distance:", jaccard(binary_vec1, binary_vec2))

# Levenshtein Distance (for string similarity)
string1 = "knn"
string2 = "kmn"
print("Levenshtein Distance:", levenshtein_distance(string1, string2))

### Effect of Distance Metrics on the Decision Boundary in KNN
The K-Nearest Neighbors (KNN) classifier relies heavily on distance metrics to determine which training samples are the "nearest neighbors" of a given point. The choice of distance metric affects the shape and flexibility of the decision boundary.

#### Distance Metrics in KNN
KNN commonly uses the following distance measures:
##### **Euclidean Distance (Default in KNN)**
- Measures the straight-line distance.
- Produces round decision boundaries (good for continuous, well-spaced data).
- Sensitive to differences in scale.
$$ d(p,q) = \sqrt{\sum (p_i - q_i)^2} $$

##### **Manhattan Distance (L1 Norm)**
- Measures distance by summing absolute differences.
- Creates grid-like decision boundaries.
- Works well when features have different units or when movements are constrained (e.g., city block movement).
$$ d(p,q) = \sum |p_i - q_i| $$

##### **Minkowski Distance (Generalization of Euclidean and Manhattan)**
- If 𝑝=1, it's Manhattan Distance.
- If 𝑝=2, it's Euclidean Distance.
$$ d(p,q) = (\sum |p_i - q_i|^p)^{\frac{1}{p}} $$

##### **Chebyshev Distance**
- Measures only the largest difference in any dimension.
- Results in square-shaped decision boundaries.
$$ d(p,q) = \max |p_i - q_i| $$

##### **Mahalanobis Distance (Accounts for Correlation Between Features)**
- Uses the covariance matrix 𝑆 to measure distances.
- Handles correlated features better.
- Requires estimating the covariance matrix, making it computationally expensive.
$$ d(p,q) = \sqrt{ (p_i - q_i)^T S^{-1} (p - q)} $$

Observations and Comparisons
|Distance Metric|	Decision Boundary Shape|	Best for|
|---------------|--------------------------|------------|
|Euclidean Distance|	Round decision boundaries (smooth)|	Well-spaced continuous data|
|Manhattan Distance|	Grid-like decision boundaries|	City-block data, different units|
|Chebyshev Distance|	Boxy decision boundaries|	Grid-based movements (chess, warehouse)|
|Mahalanobis Distance|	Adaptive decision boundary|	Correlated features, real-world finance|


### Which Distance Metric Should You Use?
- Euclidean: Default choice; works well when all features have similar importance.
- Manhattan: Preferred when features have different scales or grid-like structures.
- Chebyshev: Useful when movement is constrained (e.g., chess, delivery routes).
- Mahalanobis: Best when features are correlated, especially in financial data.

##### Decision Boundaries for Different Distance Metrics

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

# Generate synthetic dataset
X, y = make_moons(n_samples=300, noise=0.2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Function to plot decision boundaries for different distance metrics
def plot_decision_boundary(dist_metrics, X_train, y_train):
    plt.figure(figsize=(15, 5))

    for i, metric in enumerate(dist_metrics):
        knn = KNeighborsClassifier(n_neighbors=5, metric=metric)
        knn.fit(X_train, y_train)

        # Create a grid
        x_min, x_max = X_train[:, 0].min() - 0.5, X_train[:, 0].max() + 0.5
        y_min, y_max = X_train[:, 1].min() - 0.5, X_train[:, 1].max() + 0.5
        xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200),
                             np.linspace(y_min, y_max, 200))
        Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
        Z = Z.reshape(xx.shape)

        # Plot decision boundary
        plt.subplot(1, len(dist_metrics), i + 1)
        plt.contourf(xx, yy, Z, alpha=0.3)
        plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, edgecolors='k')
        plt.title(f'Metric: {metric}')

    plt.show()

# Test different distance metrics
plot_decision_boundary(['euclidean', 'manhattan', 'chebyshev'], X_train, y_train)


### Combining Functions and how its used for classification
Combining functions in the context of K-Nearest Neighbors (KNN) refers to the process of aggregating the predictions or contributions of the k nearest neighbors to make a final classification decision.  They are used  to aggregate the predictions or contributions of multiple components (e.g., neighbors, models, or features) into a final decision. These functions play a critical role in determining how individual contributions are weighted and combined to produce a coherent output. This is a crucial step in KNN because the algorithm relies on the "wisdom of the crowd" (i.e., the nearest neighbors) to determine the class of a new data point. Below, I’ll explain how combining functions work, the intuition behind them, and their role in KNN classification.

Definition:
A combining function determines how the votes of the nearest neighbors are aggregated to make a final decision.

What Are Combining Functions?
- Combining functions are mathematical or logical operations that aggregate multiple inputs into a single output. In the context of KNN, combining functions are used to:
    - Aggregate the class labels or values of the k nearest neighbors.
    - Assign weights to neighbors based on their distance or similarity to the test point.
    - Produce a final prediction (classification or regression).

Types:
- Majority voting (for Classification): The most frequent class among neighbors is chosen.
    - How It Works:
        - Each of the k nearest neighbors "votes" for its class label.
        - The class with the most votes is assigned to the test point.
    - Interpretation:
        - Assumes that the most frequent class among the neighbors is the best prediction.
        - Works well when the data is balanced and the classes are well-separated.
    - Example:
        - If k=5 and the neighbors' classes are [A, A, B, A, B], the predicted class is A.
- Weighted voting (for Classification): Closer neighbors are given more influence.
    - How It Works:
        - Each neighbor's vote is weighted by its distance or similarity to the test point.
        - Closer neighbors have higher weights.
        - The class with the highest total weight is assigned to the test point.
    - Interpretation:
        - Reflects the intuition that closer neighbors are more similar to the test point and should have a stronger influence on the prediction.
        - Useful when the data has varying densities or when some neighbors are significantly closer than others.
    - Example:
        - If k=5 and the weights for classes A and B are 2.17 and 0.73, respectively, the predicted class is A.
- Probability estimation: Predicts probability distribution instead of a strict class label.
- Averaging (for Regression)
    - How It Works:
        - The target values of the k nearest neighbors are averaged to produce the final prediction.
    - Interpretation:
        - Assumes that the target value of the test point is the average of its neighbors' values.
        - Works well when the relationship between features and the target is smooth.
    - Example:
        - If k=5 and the neighbors' target values are $[3, 5, 4, 6, 5]$, the predicted value is $\frac{3 + 5 + 4 + 6 + 5}{5} = 4.6$
- Weighted Averaging (for Regression)
    - How It Works:
        - The target values of the k nearest neighbors are weighted by their distance or similarity to the test point.
        - Closer neighbors have higher weights.
        - The final prediction is the weighted average of the neighbors' target values.
    - Interpretation:
        - Reflects the intuition that closer neighbors are more similar to the test point and should contribute more to the prediction.
        - Useful when the relationship between features and the target is non-linear or when some neighbors are significantly closer than others.
    - Example:
        - If k=5, the neighbors' target values are $[3, 5, 4, 6, 5]$, and the weights are $[0.5, 0.3, 0.2, 0.4, 0.3]$, the predicted value is: $\frac{(3 \times 0.5) + (5 \times 0.3) + (4 \times 0.2) + (6 \times 0.4) + (5 \times 0.3)}{0.5 + 0.3 + 0.2 + 0.4 + 0.3} = \frac{1.5 + 1.5 + 0.8 + 2.4 + 1.5}{1.7} = \frac{7.7}{1.7} \approx 4.53$

Interpretation of Combining Functions
- Combining functions determine how the contributions of individual components (e.g., neighbors) are aggregated to make a final decision. The choice of combining function depends on:
    - The nature of the problem: Classification vs. regression.
    - The data distribution: Balanced vs. imbalanced, linear vs. non-linear.
    - The importance of neighbors: Whether closer neighbors should have a stronger influence.

|Combining Function|	Use Case|	Interpretation|
|------------------|------------|-----------------|
|Majority Voting|	Classification|	The most frequent class among neighbors is the best prediction.|
|Weighted Voting|	Classification|	Closer neighbors have a stronger influence on the prediction.|
|Averaging|	Regression|	The target value is the average of the neighbors' values.|
|Weighted Averaging|	Regression|	Closer neighbors contribute more to the predicted value.|

Significance in KNN:
- Affects classification accuracy.
- Weighted voting can reduce misclassification when data points are unevenly distributed.

|Combining Function|	Use Case|
|------------------|------------|
|Majority Voting|	Simple, works well for balanced data.|
|Weighted Voting|	Accounts for varying distances; useful for imbalanced or dense data.|
|Kernel Smoothing|	Provides non-linear weighting; useful for complex data distributions.|
|Weighted Average|	Used in KNN regression to predict continuous values.|

In [None]:
# Majority Voting:
from collections import Counter

# Example: Nearest neighbors' classes
neighbor_classes = ['A', 'A', 'B', 'A', 'B']

# Majority voting
majority_class = Counter(neighbor_classes).most_common(1)[0][0]
print("Predicted Class (Majority Voting):", majority_class)

# Weighted Voting:
# Example: Distances and classes of nearest neighbors
distances = [1.0, 2.0, 3.0, 1.5, 2.5]
classes = ['A', 'A', 'B', 'A', 'B']

# Weighted voting
class_weights = {'A': 0, 'B': 0}
for dist, cls in zip(distances, classes):
    class_weights[cls] += 1 / dist

predicted_class = max(class_weights, key=class_weights.get)
print("Predicted Class (Weighted Voting):", predicted_class)

# Averaging:
# Example: Nearest neighbors' target values
neighbor_values = [3, 5, 4, 6, 5]

# Averaging
predicted_value = sum(neighbor_values) / len(neighbor_values)
print("Predicted Value (Averaging):", predicted_value)

# Weighted Averaging:
# Example: Nearest neighbors' target values and weights
neighbor_values = [3, 5, 4, 6, 5]
weights = [0.5, 0.3, 0.2, 0.4, 0.3]

# Weighted averaging
weighted_sum = sum(v * w for v, w in zip(neighbor_values, weights))
total_weight = sum(weights)
predicted_value = weighted_sum / total_weight
print("Predicted Value (Weighted Averaging):", predicted_value)

##### **Majority Voting (Default Combining Function)**
How It Works:
- For a given test point, KNN identifies the k nearest neighbors in the training data.
- Each neighbor "votes" for its class label.
- The class with the majority of votes is assigned to the test point.

Intuition:
- The assumption is that the most common class among the nearest neighbors is the best representation of the test point's class.
- This works well when the data is balanced and the classes are well-separated.

Example:
- Suppose k=5, and the classes of the nearest neighbors are: `[Class A, Class A, Class B, Class A, Class B]`.
    - Class A has 3 votes.
    - Class B has 2 votes.

The test point is classified as **Class A**.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from scipy.spatial.distance import euclidean, cityblock, jaccard
from Levenshtein import distance as levenshtein_distance  # Install using `pip install python-Levenshtein`

# Enable better visualization
sns.set_style("whitegrid")


from collections import Counter

# Example: Nearest neighbors' classes
neighbor_classes = ['A', 'A', 'B', 'A', 'B']

# Majority voting
majority_class = Counter(neighbor_classes).most_common(1)[0][0]
print("Predicted Class (Majority Voting):", majority_class)

##### **Explain weighted voting (scoring) or similarity moderated voting (scoring)**
Weighted voting (or similarity-modulated voting) is a technique used in K-Nearest Neighbors (KNN) to give more importance to closer neighbors when making predictions. Instead of treating all neighbors equally, weighted voting assigns a weight to each neighbor based on its distance or similarity to the test point. Closer neighbors have higher weights, while farther neighbors have lower weights.

key terms:
- Weighted Voting: Assigns higher importance to closer neighbors using weights.
- Inverse Distance Weighting: Simple and effective for most cases.
- Gaussian Kernel Weighting: Provides a smoother weighting scheme, useful for non-linear relationships.
- Class Probabilities: Calculated by normalizing the total weights for each class.

Intuition:
- Closer neighbors are more likely to be similar to the test point, so their votes should carry more importance | have a stronger influence on the prediction..
- This is useful when the data has varying densities or when some neighbors are significantly closer than others.

How It Works:
- Instead of treating all neighbors equally, closer neighbors are given more weight in the voting process.
- The weight is typically calculated as: 
    - the inverse of the distance: weight = $\frac{1}{\text{distance}}$.
    - or using a kernel function (e.g., Gaussian kernel).
- The class with the highest total weight is assigned to the test point.

Steps for Weighted Voting
- Identify the k nearest neighbors of the test point.
- Calculate the weight for each neighbor based on its distance or similarity to the test point.
- Sum the weights for each class.
- Assign the class with the highest total weight to the test point.
- Calculate class probabilities by dividing the total weight for each class by the sum of all weights.

Weight Calculation

The weight for a neighbor can be calculated in several ways:
- **Inverse Distance Weighting**:
$$ \text{weight} = \frac{1}{\text{distance}}$$

Example:
- Suppose k=5, and the distances and classes of the nearest neighbors are:
    - Neighbor 1: Distance = 1.0, Class = A
    - Neighbor 2: Distance = 2.0, Class = A
    - Neighbor 3: Distance = 3.0, Class = B 
    - Neighbor 4: Distance = 1.5, Class = A
    - Neighbor 5: Distance = 2.5, Class = B
    

Step 1: Calculate Weights
- Neighbor 1: $\frac{1}{1.0}$
- Neighbor 2: $\frac{1}{2.0}$
- Neighbor 3: $\frac{1}{3.0}$
- Neighbor 4: $\frac{1}{1.5}$
- Neighbor 5: $\frac{1}{2.5}$

Step 2: Sum Weights for Each Class
- The weights are:
    - Class A: $\frac{1}{1.0} + \frac{1}{2.0} + \frac{1}{1.5} = 1 + 0.5 + 0.67 = 2.17$
    - Class B: $\frac{1}{3.0} + \frac{1}{2.5} = 0.33 + 0.4 = 0.73$

Step 3: Assign the Class    
- The test point is classified as Class A.

Step 4: Calculate Class Probabilities
- Total weight: $2.17+0.73=2.90$

The probabilities are:
- $P(y = A|x) = \frac{2.17}{2.90} \approx 0.75$
- $P(y = B|x) = \frac{0.73}{2.90} \approx 0.25$

Weighted Voting in KNN
- Instead of using majority voting, weighted voting gives closer neighbors more influence.

Impact of Weighted Voting:
- Nearby points have more influence on the prediction.
- Helps when class distribution is imbalanced.

In [None]:
# Example: Distances and classes of nearest neighbors
distances = [1.0, 2.0, 3.0, 1.5, 2.5]
classes = ['A', 'A', 'B', 'A', 'B']

# Weighted voting
class_weights = {'A': 0, 'B': 0}
for dist, cls in zip(distances, classes):
    class_weights[cls] += 1 / dist

predicted_class = max(class_weights, key=class_weights.get)
print("Predicted Class (Weighted Voting):", predicted_class)

In [None]:
def weighted_knn(X_train, y_train, X_test, k=5):
    knn = KNeighborsClassifier(n_neighbors=k, weights='distance')  # Weighted by distance
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    return accuracy_score(y_test, y_pred)

for k in [1, 5, 15]:
    acc = weighted_knn(X_train, y_train, X_test, k)
    print(f"K={k}, Weighted KNN Accuracy: {acc:.2f}")

##### Implement weighted voting and calculate class probabilities

In [None]:
# Example: Distances and classes of nearest neighbors
distances = [1.0, 2.0, 3.0, 1.5, 2.5]
classes = ['A', 'A', 'B', 'A', 'B']

# Calculate weights (inverse distance weighting)
weights = [1 / dist for dist in distances]

# Sum weights for each class
class_weights = {'A': 0, 'B': 0}
for cls, weight in zip(classes, weights):
    class_weights[cls] += weight

# Total weight
total_weight = sum(class_weights.values())

# Calculate class probabilities
class_probabilities = {cls: weight / total_weight for cls, weight in class_weights.items()}

# Print results
print("Weights per Class:", class_weights)
print("Class Probabilities:", class_probabilities)

##### **Kernel Smoothing**
How It Works:
- A kernel function (e.g., Gaussian, Epanechnikov) is used to assign weights to neighbors based on their distance.
- The kernel function gives higher weights to closer neighbors and lower weights to farther neighbors.
- The class with the highest total kernel-weighted score is assigned to the test point.

Intuition:
- Kernel smoothing provides a more nuanced way of weighting neighbors, especially when the relationship between distance and similarity is non-linear.

Example:
- Using a Gaussian kernel:

$$ \text{weight} = \exp (- \frac{\text{distance}^2}{2h^2})$$
- where
    - h: is the bandwidth parameter.

In [None]:
import numpy as np

# Gaussian kernel function
def gaussian_kernel(distance, h=1.0):
    return np.exp(-(distance**2) / (2 * h**2))

# Example: Distances and classes of nearest neighbors
distances = [1.0, 2.0, 3.0, 1.5, 2.5]
classes = ['A', 'A', 'B', 'A', 'B']

# Kernel-weighted voting
class_weights = {'A': 0, 'B': 0}
for dist, cls in zip(distances, classes):
    weight = gaussian_kernel(dist, h=1.0)
    class_weights[cls] += weight

predicted_class = max(class_weights, key=class_weights.get)
print("Predicted Class (Kernel Smoothing):", predicted_class)

##### Use a Gaussian kernel for weighting and calcualting probability

In [None]:
import numpy as np

# Gaussian kernel function
def gaussian_kernel(distance, h=1.0):
    return np.exp(-(distance**2) / (2 * h**2))

# Calculate weights (Gaussian kernel weighting)
weights = [gaussian_kernel(dist, h=1.0) for dist in distances]

# Sum weights for each class
class_weights = {'A': 0, 'B': 0}
for cls, weight in zip(classes, weights):
    class_weights[cls] += weight

# Total weight
total_weight = sum(class_weights.values())

# Calculate class probabilities
class_probabilities = {cls: weight / total_weight for cls, weight in class_weights.items()}

# Print results
print("Weights per Class (Gaussian Kernel):", class_weights)
print("Class Probabilities (Gaussian Kernel):", class_probabilities)

### Combining Functions for Multi-Class Problems
In multi-class classification, the combining function must handle more than two classes. Both majority voting and weighted voting can be extended to multi-class scenarios by:
- Counting votes or weights for each class.
- Assigning the class with the highest total.

### Combining Functions for Regression
In KNN regression, instead of voting, the combining function typically computes a weighted average of the target values of the nearest neighbors. For example:
$$ \text{Prediction} = \frac{\sum^k_{i =1} \text{weight}_i \cdot y_i}{\sum^k_{i =1} \text{weight}_i}$$
- where 
    - $y_i$: is the target value of the i-th neighbor.

### Issues with KNN
Nearest-neighbor methods, such as K-Nearest Neighbors (KNN), are simple and intuitive but come with several challenges. Below, I’ll describe the issues related to intelligibility, dimensionality, domain knowledge, and computational efficiency, and then dive deeper into two aspects of intelligibility.

##### **Intelligibility (Interpretability)**
Intelligibility refers to how easily humans can understand and interpret the model's predictions. While KNN is conceptually simple, it has some limitations in terms of intelligibility:

**Aspects of Intelligibility**:

Lack of Explicit Model:
- KNN is an instance-based (lazy) learning algorithm, meaning it doesn’t build an explicit model during training. Instead, it memorizes the training data and makes predictions based on similarity at inference time.
- This makes it difficult to explain why a particular prediction was made, especially in high-dimensional spaces or when k is large.

Local vs. Global Understanding:
- KNN provides a local explanation for predictions (e.g., "this point is classified as Class A because its nearest neighbors are mostly Class A").
- However, it doesn’t provide a global understanding of the decision boundaries or feature importance, which can be critical for domain experts.

Example:
- If KNN classifies a patient as having a disease, it’s hard to explain which features (e.g., age, blood pressure) contributed most to the decision, unlike in models like decision trees or linear regression.

**Two Aspects of Intelligibility in Detail**

Let’s dive deeper into two key aspects of intelligibility:

Lack of Explicit Model
- Why It’s a Problem:
    - Without an explicit model, it’s hard to summarize the relationship between features and the target variable.
    - Stakeholders (e.g., doctors, business leaders) often need clear explanations for predictions, which KNN cannot provide.

Mitigation:
- Use model-agnostic interpretability techniques like LIME or SHAP to explain individual predictions.
- Combine KNN with simpler models (e.g., decision trees) for better interpretability.

Local vs. Global Understanding
- Why It’s a Problem:
    - KNN provides local explanations (e.g., "this point is classified as Class A because its neighbors are Class A"), but it doesn’t explain global patterns or feature importance.
    - This limits its usefulness in scenarios where understanding the overall behavior of the model is critical.

Mitigation:
- Use dimensionality reduction techniques (e.g., PCA, t-SNE) to visualize decision boundaries in lower dimensions.
- Combine KNN with feature importance methods to identify globally important features.

##### **Dimensionality (Curse of Dimensionality)**

As the number of features (dimensions) increases, the performance of KNN often degrades due to the curse of dimensionality:
- In high-dimensional spaces, distances between points become less meaningful because most points are equally distant from each other.
- This makes it difficult for KNN to identify meaningful neighbors, leading to poor predictions.

Example:
- In a dataset with 100 features, two points may appear close in some dimensions but far in others, making it hard to define a meaningful distance metric.

##### **Domain Knowledge**

KNN doesn’t inherently incorporate domain knowledge:
- It treats all features equally unless a custom distance metric is used.
- Domain-specific relationships or constraints (e.g., certain features being more important) must be manually encoded into the distance metric or feature weights.

Example:
- In a medical diagnosis problem, domain knowledge might suggest that blood pressure is more important than age, but KNN won’t prioritize this unless explicitly told to do so.

**How the `curse of dimensionality` could be fixed using `domain knowlegde`**

The curse of dimensionality is a significant challenge in machine learning, particularly for algorithms like K-Nearest Neighbors (KNN) that rely on distance metrics. As the number of features (dimensions) increases, the data becomes sparse, and distances between points lose meaning, making it difficult for KNN to identify meaningful neighbors. However, domain knowledge can be leveraged to mitigate this issue.

What is the Curse of Dimensionality?
- In high-dimensional spaces:
    - Data points become increasingly sparse, making it hard to define meaningful neighborhoods.
    - Distances between points tend to converge, reducing the discriminative power of distance metrics.
    - The volume of the space grows exponentially, requiring exponentially more data to maintain the same density.

How Domain Knowledge Can Help
- Domain knowledge refers to expertise or understanding of the specific field or problem you’re working on. It can be used to:
    - Select relevant features and discard irrelevant or redundant ones.
    - Weight features based on their importance.
    - Engineer new features that capture meaningful relationships in the data.
    - Define custom distance metrics that align with domain-specific relationships.
- By incorporating domain knowledge, you can reduce the effective dimensionality of the data, making it more manageable and meaningful for KNN.

`Feature Selection`
- Domain knowledge can help identify which features are most relevant to the problem, allowing you to discard irrelevant or noisy features.

Example:
- In a medical diagnosis problem, domain experts might know that features like blood pressure, cholesterol levels, and age are critical, while features like patient ID or zip code are irrelevant.
    - By selecting only the relevant features, you reduce the dimensionality and improve the performance of KNN.

In [None]:
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Example dataset
data = {
    'age': [25, 45, 35, 50, 23],
    'blood_pressure': [120, 140, 130, 150, 110],
    'cholesterol': [200, 240, 220, 260, 180],
    'patient_id': [1, 2, 3, 4, 5],
    'zip_code': [12345, 23456, 34567, 45678, 56789],
    'target': [0, 1, 1, 1, 0]
}
df = pd.DataFrame(data)

# Use domain knowledge to select relevant features
X = df[['age', 'blood_pressure', 'cholesterol']]
y = df['target']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train KNN
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)

# Evaluate
y_pred = knn.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))

`Feature Weighting`
- Domain knowledge can be used to assign weights to features based on their importance. This ensures that more important features have a greater influence on the distance calculation.

Example:
- In a credit scoring problem, domain experts might know that income and credit history are more important than the number of dependents.
- You can assign higher weights to income and credit history when computing distances.

In [None]:
import numpy as np

# Define custom weights based on domain knowledge
weights = np.array([0.5, 0.3, 0.2])  # age, blood_pressure, cholesterol

# Custom distance function
def weighted_euclidean_distance(x1, x2, weights):
    return np.sqrt(np.sum(weights * (x1 - x2)**2))

# Example points
x1 = np.array([25, 120, 200])
x2 = np.array([45, 140, 240])

# Calculate weighted distance
distance = weighted_euclidean_distance(x1, x2, weights)
print("Weighted Euclidean Distance:", distance)

`Feature Engineering`
- Domain knowledge can guide the creation of new features that capture meaningful relationships in the data, reducing the need for high-dimensional raw features.

Example:
- In a retail problem, instead of using raw transaction data, domain experts might create features like "average purchase value" or "frequency of purchases."
- These engineered features are often more informative and reduce dimensionality.

In [None]:
# Example: Feature engineering
df['average_purchase_value'] = df['total_spent'] / df['number_of_purchases']
df['purchase_frequency'] = df['number_of_purchases'] / df['days_since_first_purchase']

# Use engineered features
X = df[['average_purchase_value', 'purchase_frequency']]

`Custom Distance Metrics`
- Domain knowledge can be used to define custom distance metrics that align with domain-specific relationships.

Example:
- In a text classification problem, domain experts might know that certain words are more important than others. A custom distance metric can be designed to reflect this.

In [None]:
from sklearn.metrics.pairwise import pairwise_distances

# Custom distance metric
def custom_distance(x1, x2):
    # Domain-specific logic
    return np.sum(np.abs(x1 - x2))  # Example: Manhattan distance

# Compute pairwise distances
distances = pairwise_distances(X_train, X_test, metric=custom_distance)

Summary of Fixes Using Domain Knowledge

|Technique|	Description|	Example|
|---------|------------|-----------|
|Feature Selection|	Select only relevant features based on domain knowledge.|	Use age, blood pressure, and cholesterol for medical diagnosis.|
|Feature Weighting|	Assign higher weights to more important features.|	Weight income more heavily in credit scoring.|
|Feature Engineering|	Create new features that capture domain-specific relationships.|	Compute average purchase value in retail.|
|Custom Distance Metrics|	Define distance metrics that align with domain-specific logic.|	Use a custom text similarity metric for text classification.|

##### **Computational Efficiency**

KNN can be computationally expensive, especially for large datasets:
- Training Time: KNN is a lazy learner, so it doesn’t require training. However, it stores the entire training dataset, which can be memory-intensive.
- Inference Time: For each test point, KNN must compute distances to all training points, which scales as O(n⋅d), where n is the number of training samples and d is the number of features.
- This makes KNN impractical for large datasets or real-time applications.

Example:
- For a dataset with 1 million samples and 100 features, KNN would need to compute 100 million distances for each test point.

|Issue|	Description|	Mitigation|
|-----|------------|--------------|
|Intelligibility|	Hard to explain predictions; lacks global understanding.|	Use LIME/SHAP, combine with simpler models, or reduce dimensionality.|
|Dimensionality|	Curse of dimensionality reduces meaningfulness of distances.|	Use dimensionality reduction or feature selection.|
|Domain Knowledge|	Doesn’t incorporate domain-specific knowledge.|	Use custom distance metrics or feature weighting.|
|Computational Efficiency|	Computationally expensive for large datasets.|	Use approximate nearest neighbor algorithms (e.g., KD-Trees, Ball Trees).|

# Clustering Models (e.g., K-Means)
What It Means: 
- Clustering groups similar data points together without predefined labels, often used for segmenting customers or finding patterns.

Outcome Interpretation: 
- Each cluster represents a natural grouping in the data, with data points in the same cluster sharing similar characteristics.

Performance Measures:
- Silhouette Score: Measures how well each point fits within its cluster; values closer to 1 indicate better-defined clusters.
- Within-Cluster Sum of Squares (WCSS): Measures the compactness of clusters; lower values are better.

Lay Explanation: 
- Clustering is like sorting items into bins based on similarity, helping us identify groups in our data.

Use Case: 
- To group similar observations without predefined labels.

Model Types: 
- K-Means, 
- Hierarchical Clustering, 
- DBSCAN.

In [None]:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3)
clusters = kmeans.fit_predict(X)

###  Primary Idea Behind Clustering
Clustering is an unsupervised machine learning technique used to group similar data points together based on some notion of similarity (like distance). The main idea is:
- “Group similar items together such that items in the same group (cluster) are more similar to each other than to those in other groups.”

This is useful when you don’t have labeled data but want to find patterns or natural groupings in your dataset.

### Primary Idea Behind Hierarchical Clustering
Hierarchical clustering is a type of clustering that builds a hierarchy or tree (called a dendrogram) of clusters. There are two main approaches:

##### **Agglomerative (bottom-up)**:
1. Start with each data point as its own cluster.
2. Gradually merge the closest clusters until all data points are in a single cluster (or until a stopping criterion is met).

##### **Divisive (top-down)**:
1. Start with all data points in one cluster.
2. Recursively split the cluster into smaller clusters.

The core idea: Build a nested tree of clusters where the level of similarity is reflected in how close branches are.

It’s especially helpful when:
- You want a visual representation of data grouping (dendrogram).
- You don’t want to pre-specify the number of clusters (unlike k-means).

Intuition
- **Clustering** (in general) tries to find structure in data.
- **Hierarchical clustering** gives you multi-level grouping—you can decide how many clusters you want after seeing the hierarchy.

##### Clustering (K-Means)

If there are 3 distinct groups, K-means tries to:
- Randomly pick 3 centroids.
- Assign each point to the nearest centroid.
- Recalculate centroids based on assigned points.
- Repeat until things stabilize.

You end up with clearly separated blobs or clusters.

##### Hierarchical Clustering
- Each point = own cluster
- Then it starts merging closest pairs of clusters step-by-step, creating a tree
     - Level 1: A | B | C | D
     - Level 2: (A+B) | (C+D)
     - Level 3: (A+B+C+D)
- Result is a dendrogram (tree) like:
- cut the dendrogram at any height to choose how many clusters you want.

Hierarchical clustering is a clustering technique that builds a multilevel hierarchy (or tree) of clusters by either merging smaller clusters or dividing a larger cluster into smaller ones. The result is a dendrogram — a tree-like diagram that shows how clusters are related.

📌 Primary Features of Hierarchical Clustering

1. Hierarchical Tree Structure (Dendrogram)
     - The output is a tree (dendrogram) showing how observations or clusters are merged or split at each level.
     - You can cut the tree at any level to get the desired number of clusters.

2. Two Main Types

🔹 Agglomerative (Bottom-Up)
- Starts with each data point as its own cluster
- Repeatedly merges the two closest clusters
- Most common approach

🔹 Divisive (Top-Down)
- Starts with all data points in one cluster
- Recursively splits clusters into smaller ones

3. Distance Metrics

Determines how similar or dissimilar observations or clusters are.

Common choices:
- Euclidean (default)
- Manhattan
- Cosine
- Correlation

4. Linkage Criteria

Controls how the distance between clusters is computed:
- Single Linkage – minimum distance between any two points in different clusters
- Complete Linkage – maximum distance between any two points
- Average Linkage – average distance between all pairs of points
- Ward’s Method – minimizes the increase in total within-cluster variance after merging

5. No Need to Pre-specify K
     - Unlike K-Means, you don't need to specify the number of clusters upfront.
     - You can choose the number of clusters by cutting the dendrogram at a given height.

6. Interpretability
     - The dendrogram gives a clear visual understanding of the nested structure in data.
     - Useful when you want to explore different levels of grouping.

When to Use Hierarchical Clustering

|Use Case	|Why It’s Suitable|
|---------|-----------------|
|Small to medium datasets	|Computation increases quickly (O(n²))|
|You want interpretability	|Dendrogram gives visual insight|
|You don't know how many clusters	|Can choose K by dendrogram cut|
|You expect nested/grouped structure	|Hierarchy reveals grouping levels|

Summary

|Feature	|Description|
|---------|-----------|
|Clustering type	|Hierarchical (tree-based), agglomerative or divisive|
|Output	|Dendrogram (tree diagram)|
|Distance metric	|Euclidean (default), others supported|
|Linkage methods	|Single, complete, average, Ward’s|
|K needed in advance?	|❌ No – select K by dendrogram cut|
|Interpretability	|✅ High – shows nested relationships|
|Scalability	|❌ Poor for large datasets (quadratic complexity)|

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans, AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage

# Create sample data
X, _ = make_blobs(n_samples=200, centers=3, cluster_std=1.0, random_state=42)

# --- K-Means Clustering ---
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans_labels = kmeans.fit_predict(X)

# --- Hierarchical Clustering ---
hier = AgglomerativeClustering(n_clusters=3)
hier_labels = hier.fit_predict(X)

# --- Plotting ---
fig, axs = plt.subplots(1, 3, figsize=(18, 5))

# Original data
axs[0].scatter(X[:, 0], X[:, 1], c='gray')
axs[0].set_title("Original Data")

# K-means result
axs[1].scatter(X[:, 0], X[:, 1], c=kmeans_labels, cmap='viridis')
axs[1].scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
               s=200, c='red', marker='X', label='Centroids')
axs[1].set_title("K-Means Clustering")
axs[1].legend()

# Hierarchical Clustering result
axs[2].scatter(X[:, 0], X[:, 1], c=hier_labels, cmap='plasma')
axs[2].set_title("Hierarchical Clustering")

plt.show()

## Dendrogram for Hierarchical Clustering

# Create linkage matrix for dendrogram
linked = linkage(X, method='ward')

# Plot dendrogram
plt.figure(figsize=(12, 5))
dendrogram(linked, truncate_mode='lastp', p=20, leaf_rotation=45., leaf_font_size=12.)
plt.title("Hierarchical Clustering Dendrogram")
plt.xlabel("Cluster Size")
plt.ylabel("Distance")
plt.show()

In [None]:
from sklearn.datasets import make_blobs
from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt

# Sample data
X, _ = make_blobs(n_samples=50, centers=3, random_state=42)

# Perform hierarchical clustering
Z = linkage(X, method='ward')  # Try 'single', 'complete', etc.

# Plot dendrogram
plt.figure(figsize=(10, 5))
dendrogram(Z)
plt.title("Hierarchical Clustering Dendrogram")
plt.xlabel("Sample Index")
plt.ylabel("Distance")
plt.show()


### Feature scaling is crucial in KMeans clustering
Z-score (Standardization) and Min-Max Scaling (Normalization) helps you make the right decision for your data.

#### Why Feature Scaling is Important in KMeans?
KMeans is a distance-based algorithm (usually Euclidean distance):
$$ \text{Distance (x,y)} = \sqrt{(x_1 - y_1)^2 + (x_2 - y^2)^2 + ...}$$

So, if features are on different scales, the feature with the largest range dominates the distance metric

|Feature	|Range  |
|-----------|-------|
|Age	|18 to 70|
|Income ($)	|10k to 150k|
|Spending Score	|1 to 100 |

Without scaling, income will dominate distance calculations.

##### **Z-Score Standardization (StandardScaler)**
$$ z = \frac{(x - \mu)}{\sigma}$$

- Transforms features to have mean = 0 and std = 1
- Keeps distribution shape but puts features on comparable scale

✅ When to use:
- When features are normally distributed
- When outliers are not extreme

##### **Min-Max Scaling (MinMaxScaler)**
$$ x' = \frac{x - \text{min}}{\text{max} - \text{min}}$$

- Rescales values between 0 and 1
- Sensitive to outliers

✅ When to use:
- When you want to preserve distribution and scale into [0, 1]
- Often used in neural nets and visualizations

In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler
import pandas as pd

# Example dataset
df = pd.DataFrame({
    'Age': [19, 35, 26, 45, 23],
    'Income': [15, 75, 45, 100, 32],
    'Spending': [39, 40, 77, 20, 6]
})

# Z-score standardization
z_scaled = StandardScaler().fit_transform(df)
df_z = pd.DataFrame(z_scaled, columns=df.columns)

# Min-max normalization
minmax_scaled = MinMaxScaler().fit_transform(df)
df_minmax = pd.DataFrame(minmax_scaled, columns=df.columns)

print("Z-Score Standardization:\n", df_z)
print("\nMin-Max Normalization:\n", df_minmax)


🧠 Interpretation
|Age	|Income	|Spending|
|-------|-------|--------|
|Original	|19	|15	|39  |
|Z-Score	|-1.26	|-1.12	|-0.02  |
|Min-Max	|0.00	|0.00	|0.45   |

- Z-score scales relative to mean and variance (e.g., how many std devs a value is from the mean).
- Min-max scales based on range.

📌 In KMeans: Which to Use?

|Situation	|Recommended Scaling|
|-----------|-------------------|
|Data has different units/scales	|✅ Required|
|Data is normally distributed	|Z-score  |
|Data has fixed bounds (e.g., 0–1)	|Min-Max  |
|Outliers are present	|Z-score is safer   |

✅ Summary

|Scaling Method	|Preserves Distribution	|Bounded Output	|Sensitive to Outliers	|Centers at 0|
|---------------|-----------------------|---------------|-----------------------|------------|
|Z-score	|✅	|❌	|❌	|✅|
|Min-Max	|✅	|✅	|✅	|❌|

###  What Is Euclidean Distance?
Euclidean distance is the straight-line ("as-the-crow-flies") distance between two points in Euclidean space.

For two points:
- x = $(x_1, x_2,...,x_n)$ 
- y = $(y_1, y_2,...,y_n)$

This is derived from the Pythagorean theorem where the  Euclidean distance d(x,y) is:
$$ \text{Distance (x,y)} = \sqrt{(x_1 - y_1)^2 + (x_2 - y_2)^2 + ... + (x_n - y_n)^2}$$

In KMeans Clustering
- Euclidean distance is used to measure similarity between:
    - A data point and a cluster centroid
    - Centroids are means of all points in a cluster
    - KMeans iteratively assigns points to the nearest centroid, based on Euclidean distance.

🚨 Why Scaling Matters?
- Without scaling, features with large numerical ranges dominate the distance calculation.

🔍 Example:

|Feature	|Age (years)	|Income ($)|
|-----------|---------------|----------|
|Person 1	|25	|30,000  |
|Person 2	|30	|70,000  |

The age difference is 5, but the income difference is 40,000, which would overwhelm the impact of age unless scaled.

🔁 In Higher Dimensions (n-D)
- Euclidean distance generalizes to any number of features:
- Still calculated with square root of the sum of squared differences
- In 3D: $ \text{Distance (x,y)} = \sqrt{(x_1 - y_1)^2 + (x_2 - y_2)^2 + (x_3 - y_3)^2}$

### **What Is K-Means Clustering?**

K-Means is an unsupervised learning algorithm used to group data into K clusters, where each data point belongs to the cluster with the nearest centroid (mean of points).

### Primary Features of K-Means
|Feature	| Description |
|-----------|-------------|
|💠 Centroid-Based	|Clusters are defined around central points (centroids)|
|🔁 Iterative Process	|Repeats until convergence (minimal movement of centroids)|
|📏 Distance-Based	|Commonly uses Euclidean distance to assign points|
|🔢 K Must Be Predefined	|You must specify the number of clusters (K) in advance|
|⚡ Efficient & Scalable	|Works well for large datasets with continuous features|

Intuition
- Imagine you’re grouping customers into K groups based on spending behavior — you want to find K group centers, then assign each customer to the closest center.

### General Approach to K-Means Clustering (Centroid-Based)

🎯 Goal:
- Partition the data into K clusters, where each cluster is defined by its centroid (mean). 
- The goal is to minimize the total within-cluster variance, also known as the sum of squared distances between data points and their assigned centroids.

✅ Step-by-Step Approach

1. Choose the number of clusters (K)
    - You decide how many groups you want the data to be divided into.
    - This is often based on prior knowledge or using methods like the elbow method.
    - Decide how many clusters you want (K).
        - 🔹 Example: K = 3 (for three customer segments)

2. Initialize K centroids randomly
    - Select K data points randomly as the initial centroids.
    - Randomly pick K initial centroids from the dataset (or use smart methods like KMeans++ to improve this).
    - Each centroid represents the center of a cluster.

3. Assign each point to the nearest centroid
    - For each data point, compute the Euclidean distance to each centroid.
    - Assign the point to the cluster of the closest centroid.

4. Recalculate centroids
    - For each cluster, calculate the mean of all the points assigned to it.
    - Update the centroid position to this mean.

5. Repeat steps 3 and 4 until convergence
    - Continue assigning points and updating centroids until:
        - No change in assignments, or
        - Centroids don’t move significantly, or
        - A maximum number of iterations is reached.

Intuition Behind It
- Each cluster is trying to pull similar points toward itself.
- Centroids move to the center of mass of their assigned points.
- Eventually, a stable configuration (local minimum) is reached.

Optional Enhancements
|Enhancement	|Benefit |
|---------------|--------|
|KMeans++ Init	|Smarter starting centroids = faster, better convergence|
|Scaling Features	|Prevents domination by large-scale features|
|Elbow Method	|Helps choose the best K|
|Silhouette Score	|Measures how well points fit within clusters|
|Cluster Explainability (SHAP)	|Interprets clusters using supervised models|

### Pros and Cons

Pros:
- Simple and fast for large datasets
- Easy to interpret
- Works well when clusters are spherical and equally sized

Cons:
- Requires pre-specifying K
- Sensitive to initial centroids
- Can get stuck in local minima
- Not ideal for non-convex or differently sized/density clusters

Python example showing how centroids move over iterations in K-Means clustering using matplotlib animation. This will visually demonstrate how K-Means converges.

What This Code Does:
- Uses make_blobs to create 2D data.
- Applies K-Means step-by-step manually.
- Animates how centroids move and how points switch clusters.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation
from sklearn.datasets import make_blobs

# Step 1: Generate synthetic data
X, _ = make_blobs(n_samples=200, centers=3, cluster_std=1.0, random_state=42)

# Step 2: Initialize parameters
k = 3
np.random.seed(42)
initial_centroids = X[np.random.choice(X.shape[0], k, replace=False)]
centroids = initial_centroids.copy()
colors = ['red', 'green', 'blue']
history_centroids = [centroids.copy()]
history_labels = []

# Helper function to assign clusters
def assign_clusters(X, centroids):
    distances = np.linalg.norm(X[:, np.newaxis] - centroids, axis=2)
    return np.argmin(distances, axis=1)

# K-means loop
for _ in range(10):  # Run for fixed 10 iterations
    labels = assign_clusters(X, centroids)
    history_labels.append(labels.copy())
    
    new_centroids = np.array([X[labels == i].mean(axis=0) for i in range(k)])
    if np.allclose(centroids, new_centroids):
        break  # Convergence
    centroids = new_centroids
    history_centroids.append(centroids.copy())

# Step 3: Animate the centroid movement
fig, ax = plt.subplots(figsize=(8, 6))

def update(frame):
    ax.clear()
    centroids = history_centroids[frame]
    labels = history_labels[frame if frame < len(history_labels) else -1]
    
    for i in range(k):
        points = X[labels == i]
        ax.scatter(points[:, 0], points[:, 1], s=30, color=colors[i], label=f'Cluster {i+1}')
        ax.scatter(*centroids[i], color='black', s=200, marker='X', edgecolor='white', linewidth=2)
    
    ax.set_title(f'Iteration {frame + 1}')
    ax.legend()

ani = FuncAnimation(fig, update, frames=len(history_centroids), interval=1000, repeat=False)
plt.show()

### Step-by-Step Walkthrough of the K-Means Logic (Manual Iterations)

1. Generate Synthetic Data
    - We create a dataset X with 200 points and 3 natural clusters.
    - make_blobs is perfect for this kind of demo.

In [None]:
X, _ = make_blobs(n_samples=200, centers=3, cluster_std=1.0, random_state=42)

2. Initialize Parameters and Random Centroids\
- k = 3 means we want to find 3 clusters.
- We randomly choose 3 data points from X to act as the starting centroids.
- These are the "black X" markers in the animation’s first frame.

In [None]:
k = 3
initial_centroids = X[np.random.choice(X.shape[0], k, replace=False)]
centroids = initial_centroids.copy()

3. Begin K-Means Iterations
- We perform these steps manually, up to 10 times (can stop earlier if convergence happens).

In [None]:
for _ in range(10):
    labels = assign_clusters(X, centroids)
    ...

3. Assign Each Point to the Nearest Centroid
    - For every point in X, compute the distance to each centroid.
    - Assign each point to the nearest centroid.
    - This is when colors of the points change in the animation.

In [None]:
def assign_clusters(X, centroids):
    distances = np.linalg.norm(X[:, np.newaxis] - centroids, axis=2)
    return np.argmin(distances, axis=1)

3. Recompute the Centroids
    - For each cluster:
        - Take the mean of all points currently assigned to it.
        - This new mean becomes the new centroid (center of gravity).
    - This is when the X markers jump in the animation.

In [None]:
new_centroids = np.array([X[labels == i].mean(axis=0) for i in range(k)])

3. Check for Convergence
    - If the centroids don't move (or barely move), the algorithm stops early.
    - In the animation, this is the last frame, where points and centroids stabilize.

In [None]:
if np.allclose(centroids, new_centroids):
    break  # Converged

4. Animation Setup
    - Each frame in the animation does the following:
        - Points are colored based on cluster assignment.
        - Centroids are marked with a large black X.
        - You can literally watch the centroids move as the model reassigns and recalculates.

In [None]:
def update(frame):
    ...
    # Plot points by cluster label
    ax.scatter(points[:, 0], points[:, 1], color=colors[i])
    
    # Plot centroids as 'X' markers
    ax.scatter(*centroids[i], color='black', marker='X', s=200)

##### Manual implementation of the K-Means algorithm in Python, including:
🧹 Data setup

📏 Manual distance calculation

🎯 Assigning clusters

♻️ Updating centroids

🔁 Iterating

📊 Visualizing

In [None]:
# Step 1: Setup
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# For reproducibility
np.random.seed(42)

# Create some simple 2D data
X = np.array([
    [1, 2], [1, 4], [1, 0],
    [4, 2], [4, 4], [4, 0]
])

plt.figure(figsize=(6, 5))
plt.scatter(X[:, 0], X[:, 1], s=100)
plt.title("Original Data Points")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.grid(True)
plt.show()

# Step 2: Initialize K and Centroids
K = 2
# Randomly initialize two centroids (here we pick two points directly)
centroids = np.array([X[0], X[3]])  # e.g., [1,2] and [4,2]

def plot_clusters(X, centroids, labels=None, title=""):
    plt.figure(figsize=(6, 5))
    if labels is None:
        plt.scatter(X[:, 0], X[:, 1], s=100)
    else:
        sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=labels, palette='Set2', s=100, legend=False)
    plt.scatter(centroids[:, 0], centroids[:, 1], c='black', s=200, marker='X', label='Centroids')
    plt.title(title)
    plt.grid(True)
    plt.show()

plot_clusters(X, centroids, title="Initial Centroids")

# Step 3: Assign Points to Nearest Centroid
def assign_clusters(X, centroids):
    distances = np.linalg.norm(X[:, np.newaxis] - centroids, axis=2)
    return np.argmin(distances, axis=1)

labels = assign_clusters(X, centroids)
plot_clusters(X, centroids, labels, title="Assigned Clusters (Iteration 1)")


# Step 4: Update Centroids
def update_centroids(X, labels, K):
    new_centroids = np.array([X[labels == k].mean(axis=0) for k in range(K)])
    return new_centroids

new_centroids = update_centroids(X, labels, K)
print("New Centroids:\n", new_centroids)


# Step 5: Iterate Until Convergence
tolerance = 1e-4
max_iters = 10

for i in range(max_iters):
    labels = assign_clusters(X, centroids)
    new_centroids = update_centroids(X, labels, K)
    
    plot_clusters(X, new_centroids, labels, title=f"Iteration {i+1}")
    
    # Check for convergence
    if np.allclose(centroids, new_centroids, atol=tolerance):
        print(f"Converged in {i+1} iterations")
        break
    centroids = new_centroids


### How to calculate and interpret the centroid of a cluster in KMeans,

**What Is a Centroid in KMeans?**

In KMeans clustering, a centroid is the center (or average) of all points assigned to a cluster.
$$ \text{Centroid} = \frac{1}{n} \sum^n_{i = 1} x_i $$

Where:
- n = number of points in the cluster
- $𝑥_𝑖$ = each point (a vector of features)

Intuition:
Think of the centroid as the "prototype customer" or "average profile" of the people in a cluster.

For example, if you clustered customers based on Age and Annual Income, the centroid might be:

|Feature	|Cluster 1 Centroid|
|-----------|------------------|
|Age	|28.7 years  |
|Income ($)	 |45,500   |

This represents the average customer profile in Cluster 1.

🧪 Example Calculation (Step-by-Step)
Suppose we have the following cluster with 3 customers:
|Customer 	|Age	 |Income (in $1k)|
|-----------|--------|---------------|
|A	|25	|40|
|B	|30	|50|
|C	|35	|45|

➤ Step 1: Add Up Each Feature
- Age: $25+30+35=90$
- Income: $40+50+45=135$

➤ Step 2: Divide by Number of Points (3)
- Centroid Age: $ \frac{90}{3} = 30$
- Centroid Income: $\frac{135}{3} = 45$

Cluster Centroid:
$$\text{Centroid}=(30,45)$$

Interpretation in Clustering
- Centroids are not necessarily actual data points.
- They summarize the location of the cluster in feature space.
- KMeans uses centroids to:
    - Reassign points (based on nearest centroid)
    - Update centroid after each iteration

In [None]:
import numpy as np
import pandas as pd

# Cluster data
data = np.array([[25, 40],
                 [30, 50],
                 [35, 45]])

# Compute centroid
centroid = np.mean(data, axis=0)
print("Centroid:", centroid)


### Interpreting the role of Supervised Learning and intepreting Cluster analysis results
Touches on the intersection between unsupervised and supervised learning,

Goal of Cluster Analysis
- Cluster analysis is an unsupervised learning method:
    - It finds natural groupings in data without using labels.
    - It’s exploratory and used for understanding structure, segmenting customers, etc.

But here's the catch:
➡️ Clusters by themselves don't tell us why they exist or what they mean.

### Role of Supervised Learning in Interpreting Clusters
Once clustering is done, supervised learning helps in interpreting and validating those clusters by:

1. Predicting Cluster Membership
    - You treat the cluster labels (from K-Means, Hierarchical, etc.) as pseudo-labels (targets) and train a classifier (e.g., Decision Tree, Random Forest, XGBoost) to:
        - Understand what features explain cluster membership.
        - Determine rules that define each cluster.

This turns the problem into supervised classification:
- Input: Features (X)  
- Target: Cluster label (from unsupervised model)

2. Feature Importance
    - Supervised models can quantify feature importance:
        - Helps interpret which features distinguish clusters.
        - Especially useful when clusters are not easily visualized (high-dimensional data).
            - Example: A Random Forest might reveal that:
            - Cluster A = High income + urban
            - Cluster B = Low income + rural

3. Naming and Profiling Clusters
    - Once features are understood, supervised learning helps:
        - Name clusters (“Budget Shoppers”, “Luxury Travelers”, etc.)
        - Design strategies: marketing, policy targeting, etc.

🧠 Real-World Analogy
- Clustering: You group customers into 4 segments based on behavior
- Supervised Learning: You now want to know what makes each segment different → use decision trees or XGBoost to model this.

In [None]:
from sklearn.datasets import make_blobs
from sklearn.ensemble import RandomForestClassifier
from sklearn.cluster import KMeans
import pandas as pd

# Step 1: Generate or load data
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=1.2, random_state=42)

# Step 2: Cluster using K-Means
kmeans = KMeans(n_clusters=4, random_state=42)
cluster_labels = kmeans.fit_predict(X)

# Step 3: Train a supervised model using cluster labels
clf = RandomForestClassifier()
clf.fit(X, cluster_labels)

# Step 4: Feature importances
importances = clf.feature_importances_
print("Feature importances to explain clusters:", importances)

When is this useful?
- Customer segmentation → Understand traits of each group
- Fraud detection → Explain why certain transactions fall in an "anomalous" cluster
- Healthcare → Group patients, then explain what factors drive group differences

Bonus Tip:
- You can also combine clustering + supervised learning in semi-supervised learning setups or in active learning pipelines — great when you have partial labels.

##### full example using customer demographics, and show how to:
- Cluster the customers (unsupervised)
- Train a supervised model to predict cluster membership
- Use SHAP to interpret how features explain each cluster

We’ll use a real-world dataset from Mall Customers, which includes:
- Age
- Annual Income
- Spending Score

Step 1: Load and Inspect the Dataset

In [None]:
import pandas as pd

# Load dataset
df = pd.read_csv("Mall_Customers.csv")
df.head()

We’ll use numeric features: Age, Annual Income, Spending Score.

Step 2: Cluster Customers Using KMeans

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

# Select features
X = df[['Age', 'Annual Income (k$)', 'Spending Score (1-100)']]

# Scale data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# KMeans clustering
kmeans = KMeans(n_clusters=4, random_state=42)
cluster_labels = kmeans.fit_predict(X_scaled)

# Add cluster labels to original DataFrame
df['Cluster'] = cluster_labels


Step 3: Train a Supervised Model to Predict Cluster

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Use original features (not scaled) for interpretability
X = df[['Age', 'Annual Income (k$)', 'Spending Score (1-100)']]
y = df['Cluster']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Train Random Forest
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

 Step 4: Interpret with SHAP

In [None]:
import shap

# Create SHAP explainer
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)

# Visualize summary for all classes
shap.summary_plot(shap_values, X_test, plot_type="bar")

# You can also drill down to a single cluster:
# Explain predictions for one cluster
cluster_index = 1
shap.summary_plot(shap_values[cluster_index], X_test)


What Does SHAP Tell You?
- Which features (age, income, score) are driving assignment to each cluster
- Helps name clusters:
    - Cluster 0 = High Income + High Spend
    - Cluster 2 = Young + Low Spend
- Enables transparent business actions (target marketing, loyalty programs)

## Cluster Validation metrics
- WCSS Inertia
- Silhouette Score
- Davids-Bouldin Index (DBI)
- Calinski–Harabasz Index (CHI)

### Tracking WCSS (Within-Cluster Sum of Squares) 
- is a great way to monitor how the clustering is improving over iterations.

### **What Is Inertia in K-Means?**
Inertia is the Within-Cluster Sum of Squares (WCSS) — a key internal validation metric that measures how tightly grouped the data points in each cluster are.

Definition:
Inertia is the sum of squared distances between each data point and the centroid of its assigned cluster.

What is WCSS?
- WCSS is the sum of squared distances between each point and its assigned cluster centroid. 

Formula

For each cluster:
$$ \text{Inertia} = \sum^n_{i=1} || x_i - \mu_{c(i)}||^2 $$
Where:
- $x_i$ : a data point
- $\mu_{c(i)}$ : the centroid of the cluster assigned to $𝑥_𝑖$
- ∥⋅∥: Euclidean norm (distance)
- Sum is over all data points

It measures how tight the clusters are:
- Lower WCSS → Better clustering (points are closer to their centroids).

Why Is Inertia Important?
|Purpose	|Explanation |
|-----------|------------|
|✅ Measures cluster compactness 	|Lower inertia = tighter clusters|
|🔍 Helps evaluate K (number of clusters)	|Useful in the elbow method|
|🔧 Optimization metric in K-Means 	|The algorithm minimizes inertia during training |

How to Track WCSS in Each Iteration?
- Let’s add this logic into the K-Means loop:

In [None]:
wcss_history = []

for _ in range(10):
    labels = assign_clusters(X, centroids)
    history_labels.append(labels.copy())

    # Compute WCSS for this iteration
    wcss = sum(np.sum((X[labels == i] - centroids[i])**2) for i in range(k))
    wcss_history.append(wcss)

    new_centroids = np.array([X[labels == i].mean(axis=0) for i in range(k)])
    if np.allclose(centroids, new_centroids):
        break
    centroids = new_centroids
    history_centroids.append(centroids.copy())


Plot WCSS Over Iterations
- After running the above loop, you can visualize how clustering improves:

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 4))
plt.plot(range(1, len(wcss_history)+1), wcss_history, marker='o', linestyle='--', color='purple')
plt.title('WCSS over Iterations')
plt.xlabel('Iteration')
plt.ylabel('WCSS')
plt.grid(True)
plt.show()


##### When and Why to Use Inertia
✅ Use Cases:
- Evaluating model quality during unsupervised learning
- Choosing the optimal number of clusters (K) using the elbow method
- Comparing different clustering results

Use it to track performance across different values of K.

🚫 Caveats:
- Inertia always decreases as K increases (more clusters = smaller groups), so:
- It’s not suited to determine the best K alone — combine with silhouette score or Davies-Bouldin index
- It’s sensitive to scaling (use z-score or MinMax scaling)

##### Python Example: Calculate Inertia

In [None]:
from sklearn.cluster import KMeans

# Use the same 2D data
X = np.array([
    [1, 2], [1, 4], [1, 0],
    [4, 2], [4, 4], [4, 0]
])

kmeans = KMeans(n_clusters=2, random_state=42)
kmeans.fit(X)

# Inertia (WCSS)
print("Inertia (WCSS):", kmeans.inertia_)

### Elbow Method
a powerful technique to choose the optimal number of clusters (K) in K-Means.

### What Is the Elbow Method?
The Elbow Method helps you determine the ideal number of clusters (K) by plotting:
$$ \text{Number of clusters (K) vs. Inertia (WCSS)}$$
- Inertia / WCSS = Sum of squared distances from each point to its assigned cluster centroid.

The idea is to find the "elbow" point — where adding more clusters doesn’t significantly reduce inertia anymore.

When plotted, the graph typically shows:
- A steep drop in inertia as K increases
- A bend or "elbow" point
- After the elbow, improvements in clustering diminish

The Elbow Method involves:
- Running K-Means for different values of K (e.g., from 1 to 10).
- Calculating WCSS for each value of K.
    - Record Inertia (WCSS) for each K
- Plotting K vs WCSS.
- Finding the “elbow” point — where WCSS starts to flatten — indicating the best trade-off between cluster compactness and model simplicity.
- Choose that K value as optimal

Why It Works:
- For small K, WCSS drops rapidly as clusters better explain the data.
    - A lower inertia suggests that points are closer to their centroids.
- After a certain K, adding more clusters gives diminishing returns.
- The elbow shows where to stop.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

# Step 1: Generate synthetic data
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=1.0, random_state=42)

# Step 2: Run KMeans for different values of K
wcss = []
K_range = range(1, 11)

for k in K_range:
    kmeans = KMeans(n_clusters=k, init='k-means++', n_init=10, max_iter=300, random_state=42)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)  # inertia_ is the WCSS

# Step 3: Plot the Elbow Curve
plt.figure(figsize=(8, 5))
plt.plot(K_range, wcss, 'bo--', linewidth=2, markersize=8)
plt.title('Elbow Method to Determine Optimal K')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('WCSS (Within-Cluster Sum of Squares)')
plt.xticks(K_range)
plt.grid(True)
plt.show()

In [None]:
inertias = []
K_range = range(1, 10)

for k in K_range:
    km = KMeans(n_clusters=k, random_state=42)
    km.fit(X)
    inertias.append(km.inertia_)

plt.plot(K_range, inertias, 'bo-')
plt.xlabel('Number of clusters K')
plt.ylabel('Inertia (WCSS)')
plt.title('Elbow Method')
plt.grid(True)
plt.show()

# The value of K where inertia drops sharply and then levels off. That’s often a good choice for K.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

# Sample 2D data
X = np.array([
    [1, 2], [1, 4], [1, 0],
    [4, 2], [4, 4], [4, 0]
])

# Track inertia for different K
inertias = []
K_range = range(1, 10)

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X)
    inertias.append(kmeans.inertia_)

# Plot the elbow curve
plt.figure(figsize=(8, 5))
plt.plot(K_range, inertias, 'bo-')
plt.xlabel('Number of Clusters K')
plt.ylabel('Inertia (WCSS)')
plt.title('Elbow Method for Optimal K')
plt.grid(True)
plt.show()

How to Interpret the Plot:
- Look for the "elbow" or "knee" in the curve — the point where the rate of decrease sharply slows.
- That K is a good balance of model simplicity and clustering quality.

Example:
- If your WCSS looks like:

|K	| WCSS  |
|---|-------|
|1	| 1700  |
|2	|900    |
|3	|500    |
|4	|250    |
|5	|230    |
|6	|220    |

You might notice a strong bend (elbow) at K = 3 or 4 → those are good candidates.

##### When Is the Elbow Point Clear?
- A clear bend in the curve makes the elbow obvious.

- If the plot is smooth and gradual, consider:
    - Silhouette score
    - Davies-Bouldin index
    - Calinski-Harabasz index
    - KneeLocator (to automate elbow detection)

### Automate Elbow Detection using KneeLocator

In [None]:
!pip install kneed

In [None]:
from kneed import KneeLocator
import numpy as np
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt

# Step 1: Generate synthetic data
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=1.0, random_state=42)

# Step 2: Compute WCSS for K = 1 to 10
wcss = []
K_range = range(1, 11)

for k in K_range:
    kmeans = KMeans(n_clusters=k, init='k-means++', n_init=10, max_iter=300, random_state=42)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)

# Step 3: Detect the 'elbow' automatically
knee = KneeLocator(K_range, wcss, curve='convex', direction='decreasing')
print(f"Elbow point (optimal K): {knee.elbow}")

# Step 4: Plot with elbow
plt.figure(figsize=(8, 5))
plt.plot(K_range, wcss, 'bo--')
plt.axvline(x=knee.elbow, color='red', linestyle='--', label=f"Elbow at K={knee.elbow}")
plt.xlabel('Number of Clusters (K)')
plt.ylabel('WCSS')
plt.title('Elbow Method with KneeLocator')
plt.legend()
plt.grid(True)
plt.show()

This will output the optimal K and visually mark it with a red dashed line on the elbow plot.

|Metric	|Description|
|-------|-----------|
|Inertia	|Measures total WCSS (within-cluster sum of squares)|
|Goal	|Minimize it for compact, well-defined clusters|
|Use	|Cluster evaluation, elbow method|
|Limit	|Always decreases with K; not ideal alone|

### Silhouette Score Comparison

What is Silhouette Score?

The Silhouette Score measures how similar an object is to its own cluster (cohesion) compared to other clusters (separation).

It answers:
- "How well is each point placed within its cluster?"

Formula
For each point i:

$$ s(i) = \frac{b(i) - a(i)}{\max{a(i), b(i)}}$$
- Where:
    - a(i) mean distance between i and other points in the same cluster (intra-cluster distance)
    - b(i) = mean distance between i and points in the nearest other cluster (nearest-cluster distance.

The Silhouette Score measures how well each data point fits within its assigned cluster:
- Range: -1 to 1
    - +1: Point is well-matched to its own cluster and poorly matched to others.
    - 0: Point is on or very close to the boundary between two clusters.
    - −1: Point is likely assigned to the wrong cluster.
- Closer to 1 → better clustering.
- Use it to validate or complement elbow method.

How to Use the Silhouette Method to Choose K
- Run clustering for a range of K values
- Compute average silhouette score for all points
- Plot K vs average silhouette score
- Select the K with the highest silhouette score.

🧪 Code to Compute Silhouette Scores for Multiple K:

In [None]:
from sklearn.metrics import silhouette_score

silhouette_scores = []

# Skip K=1 because silhouette score is undefined for a single cluster
K_range_silhouette = range(2, 11)

for k in K_range_silhouette:
    kmeans = KMeans(n_clusters=k, init='k-means++', n_init=10, max_iter=300, random_state=42)
    cluster_labels = kmeans.fit_predict(X)
    score = silhouette_score(X, cluster_labels)
    silhouette_scores.append(score)

# Plot Silhouette Scores
plt.figure(figsize=(8, 5))
plt.plot(K_range_silhouette, silhouette_scores, 'go--')
plt.title('Silhouette Score for Different K')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Silhouette Score')
plt.xticks(K_range_silhouette)
plt.grid(True)
plt.show()

# Best K according to silhouette score
best_k_silhouette = K_range_silhouette[np.argmax(silhouette_scores)]
print(f"Best K by Silhouette Score: {best_k_silhouette}")

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.datasets import make_blobs

# Simulated data
X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=42)

silhouette_scores = []
K_range = range(2, 11)

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X)
    score = silhouette_score(X, kmeans.labels_)
    silhouette_scores.append(score)
    print(f"K={k} => Silhouette Score: {score:.4f}")

# Plot
plt.figure(figsize=(8, 5))
plt.plot(K_range, silhouette_scores, 'bo-')
plt.xlabel('Number of Clusters K')
plt.ylabel('Average Silhouette Score')
plt.title('Silhouette Method for Optimal K')
plt.grid(True)
plt.show()


Interpretation Tips:

|Method	|What to Look For |
|-------|-----------------|
|Elbow	|Look for a knee in WCSS drop |
|Silhouette	 |Look for the peak (maximum score) |

How to Interpret the Plot
- Peak silhouette score = Best K
- A drop in the score after the peak means adding more clusters reduces overall quality
- High silhouette score (> 0.5) = good clustering
- Low score (< 0.25) = poor clustering

You may sometimes get slightly different results — use domain knowledge + business context to finalize.

|Feature	|Silhouette Method |
|-----------|------------------|
|🎯 Purpose	|Evaluate clustering quality and choose K|
|📐 Metric	|Silhouette Score (from -1 to +1)|
|🧠 Best K	|Value with maximum average silhouette score|
|📉 Advantage	|Balances cohesion and separation|
|⚠️ Limitation	|May not perform well with overlapping clusters|

### **Davies–Bouldin Index (DBI)**
- a powerful yet lesser-known metric.

Cluster Validation Metrics Overview

|Metric 	|Range  	|Goal   	|Interpretation    |
|-----------|-----------|-----------|------------------|
|WCSS	|0 → ∞	|Minimize	|Measures total squared distance to centroids|
|Silhouette Score	|-1 → 1|	|Maximize	|Measures separation and cohesion|
|Davies-Bouldin Index	|0 → ∞|	Minimize|	Ratio of intra-cluster distance to inter-cluster distance|

### What is the Davies–Bouldin Index (DBI)?
The DBI compares similarity between clusters, where similarity is a function of:
- Intra-cluster scatter (how tight each cluster is)
- Inter-cluster separation (how far clusters are apart)

✅ Lower DBI means better clustering (tight, well-separated clusters)

Compute WCSS, Silhouette Score, and DBI all together for different K:

In [None]:
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, davies_bouldin_score
import matplotlib.pyplot as plt
import numpy as np

# Step 1: Create data
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=1.0, random_state=42)

# Step 2: Prepare storage
K_range = range(2, 11)
wcss = []
silhouette_scores = []
dbi_scores = []

# Step 3: Loop through different K values
for k in K_range:
    kmeans = KMeans(n_clusters=k, init='k-means++', n_init=10, max_iter=300, random_state=42)
    labels = kmeans.fit_predict(X)

    # Metrics
    wcss.append(kmeans.inertia_)
    silhouette_scores.append(silhouette_score(X, labels))
    dbi_scores.append(davies_bouldin_score(X, labels))

# Step 4: Plot all three metrics
fig, axs = plt.subplots(3, 1, figsize=(8, 12), sharex=True)

axs[0].plot(K_range, wcss, 'bo--')
axs[0].set_title("WCSS (Within-Cluster Sum of Squares)")
axs[0].set_ylabel("WCSS")
axs[0].grid(True)

axs[1].plot(K_range, silhouette_scores, 'go--')
axs[1].set_title("Silhouette Score")
axs[1].set_ylabel("Silhouette")
axs[1].grid(True)

axs[2].plot(K_range, dbi_scores, 'ro--')
axs[2].set_title("Davies–Bouldin Index")
axs[2].set_ylabel("DBI (Lower is Better)")
axs[2].set_xlabel("Number of Clusters (K)")
axs[2].grid(True)

plt.tight_layout()
plt.show()


How to Use These Together:
- WCSS tells you how compact clusters are.
- Silhouette Score tells you how distinct the clusters are.
- DBI tells you how well-separated + compact clusters are (combined signal).

|K	|WCSS	|Silhouette	|DBI  |
|---|-------|-----------|-----|
|2	|1500	|0.62	 |0.9  |
|3	|1000	|0.67	|0.75  |
|4	|600	|0.71	|0.53   |
|5	|520	|0.65	|0.60   |

✅ Best K: 4, based on peak Silhouette + lowest DBI.

### Calinski–Harabasz Index (CHI) 
is another solid metric for evaluating cluster quality.

What is the Calinski–Harabasz Index?
- Also called the Variance Ratio Criterion, it evaluates clustering by comparing:
    - Between-cluster dispersion (how far clusters are from each other)
    - Within-cluster dispersion (how tight points are within a cluster)

✅ Higher CHI is better — it means dense, well-separated clusters.

|Metric	|Range	|Goal  |	Best Value |
|-------|-------|------|---------------|
|WCSS	|0 → ∞	|Minimize  	|Smallest  |
|Silhouette Score	|-1 → 1	|Maximize	|Closest to 1 |
|Davies–Bouldin Index	|0 → ∞	|Minimize	|Smallest |
|Calinski–Harabasz	|0 → ∞	|Maximize	|Largest   |

In [None]:
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score
import matplotlib.pyplot as plt
import numpy as np

# Step 1: Generate synthetic data
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=1.0, random_state=42)

# Step 2: Prepare lists for metrics
K_range = range(2, 11)
wcss = []
silhouette_scores = []
dbi_scores = []
chi_scores = []

# Step 3: Loop and compute all 4 metrics
for k in K_range:
    kmeans = KMeans(n_clusters=k, init='k-means++', n_init=10, max_iter=300, random_state=42)
    labels = kmeans.fit_predict(X)

    wcss.append(kmeans.inertia_)
    silhouette_scores.append(silhouette_score(X, labels))
    dbi_scores.append(davies_bouldin_score(X, labels))
    chi_scores.append(calinski_harabasz_score(X, labels))

# Step 4: Plot all metrics
fig, axs = plt.subplots(4, 1, figsize=(8, 14), sharex=True)

axs[0].plot(K_range, wcss, 'bo--')
axs[0].set_title("WCSS (Lower is Better)")
axs[0].set_ylabel("WCSS")
axs[0].grid(True)

axs[1].plot(K_range, silhouette_scores, 'go--')
axs[1].set_title("Silhouette Score (Higher is Better)")
axs[1].set_ylabel("Silhouette")
axs[1].grid(True)

axs[2].plot(K_range, dbi_scores, 'ro--')
axs[2].set_title("Davies–Bouldin Index (Lower is Better)")
axs[2].set_ylabel("DBI")
axs[2].grid(True)

axs[3].plot(K_range, chi_scores, 'mo--')
axs[3].set_title("Calinski–Harabasz Index (Higher is Better)")
axs[3].set_ylabel("CHI")
axs[3].set_xlabel("Number of Clusters (K)")
axs[3].grid(True)

plt.tight_layout()
plt.show()

Interpretation Strategy

Use these together:
- 🔍 Look for elbow in WCSS
- ✅ Find peak in Silhouette & CHI
- 🔻 Look for minimum in DBI

If K = 4 gives:
- High Silhouette
- High CHI
- Low DBI
→ It’s a great candidate.

### **What Is the Gap Statistic?**
The Gap Statistic compares the inertia (within-cluster variation) of your data’s clustering result to the expected inertia under a null reference distribution (i.e., random/uniformly distributed data).
- It tells you how much better your clustering is compared to clustering on random data.

It compares the within-cluster dispersion of your actual data to that of a reference dataset with no structure (i.e., data generated uniformly at random).
- Idea: If the clustering on your real data is much better (more compact) than clustering on random noise, that's a good K.

Why Use the Gap Statistic?
- It corrects for the natural tendency of WCSS to decrease as K increases (a flaw in the Elbow Method).
- It’s model-agnostic and more robust for choosing K.
- It evaluates how much structure exists in the data relative to noise.

Formula
- For each number of clusters 𝑘:
$$ \text{Gap(k)} = E[\log (W^*_k)] - \log(W_k) $$
- Where:
    - $W_k$ : Within-cluster dispersion for your actual data
    - $𝑊^*_𝑘$ : Expected within-cluster dispersion for random (null) data
    - E[⋅]: Average over multiple simulations

### How the Gap Statistic Works
For each value of 𝐾 (number of clusters):
1. Run K-Means on your data → compute inertia $𝑊_𝑘$
​2. Generate B reference datasets (e.g., random data in the same range)
3. Run K-Means on each reference dataset → compute their average inertia $E[W^{ref}_k]$
4. Compute the Gap:
$$ \text{Gap(k)} = E_{ref}[\log (W^{red}_k)] - \log(W_k) $$
5. Choose the smallest K such that:
$$ \text{Gap(k)} \geq \text{Gap(k +1)} - s_{k+1}$$
- where: 
    - $s_{k+1}$ is the standard deviation of the log inertia from reference datasets.

Interpretation
- Higher gap = better clustering compared to random
- Choose the smallest K such that:
$$ \text{Gap(k)} \geq \text{Gap(k +1)} - s_{k+1}$$

Interpreting the Gap
- Large Gap(k) → Data is much more structured than random → Good number of clusters
- Optimal K is the first K where the gap begins to narrow (similar to elbow logic, but statistically grounded)

When to Use Gap Statistic
- When elbow or silhouette plots are ambiguous
- When you want a more principled, statistically driven method
- When cluster quality is crucial for business decisions (e.g., customer segmentation)

Python Implementation (Using gap-statistic Package)

In [None]:
!pip install gap-stat

In [None]:
from sklearn.datasets import make_blobs
from gap_statistic import OptimalK
import matplotlib.pyplot as plt

# Generate synthetic data
X, _ = make_blobs(n_samples=500, centers=4, random_state=42)

# Initialize the Gap Statistic
optimalK = OptimalK(parallel_backend='joblib')  # Can also use 'threading'

# Compute the optimal K
n_clusters = optimalK(X, cluster_array=np.arange(1, 11))
print(f"Optimal number of clusters: {n_clusters}")

# Plot the gap values
plt.plot(optimalK.gap_df.n_clusters, optimalK.gap_df.gap_value, marker='o')
plt.title('Gap Statistic')
plt.xlabel('Number of Clusters')
plt.ylabel('Gap Value')
plt.grid(True)
plt.show()


In [None]:
import numpy as np
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from gap_statistic import OptimalK
import matplotlib.pyplot as plt

# Simulated data with clear clusters
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.6, random_state=42)

# Compute optimal K using Gap Statistic
optimalK = OptimalK(parallel_backend='rust')  # Use 'joblib' if rust fails
n_clusters = optimalK(X, cluster_array=np.arange(1, 11))

print(f"Optimal number of clusters by Gap Statistic: {n_clusters}")

# Plot gap values
plt.plot(optimalK.gap_df.n_clusters, optimalK.gap_df.gap_value, marker='o')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Gap Statistic')
plt.title('Gap Statistic Method for Optimal K')
plt.grid(True)
plt.show()

Feature	Description
|📐 Metric	|Gap = Difference in log(WCSS) between real and random|
|------------|------------------------------------------------------|
|🎯 Goal	|Maximize Gap (find biggest drop from reference)|
|🧪 Evaluates	|Whether real data has stronger clustering than noise|
|✅ Usefulness	|More objective and statistically robust than elbow|
|⚠️ Limitation	|Computationally expensive due to repeated KMeans|


Quick Comparison of K-Selection Methods

|Method	|Goal	|Best K =|
|-------|-------|--------|
|Elbow	|Sharp drop in inertia	|Elbow point|
|Silhouette	|Maximize cluster quality|	Max score|
|Gap Statistic|	Compare to random null|	Max gap with tolerance|

STEP 1: Create a Customer Demographics Dataset

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import pairwise_distances
import random

# Simulated customer demographics: Age, Income, SpendingScore
np.random.seed(42)
n_samples = 300

age = np.random.normal(40, 12, n_samples)
income = np.random.normal(60000, 15000, n_samples)
score = np.random.normal(50, 20, n_samples)

df = pd.DataFrame({
    'Age': age,
    'Income': income,
    'SpendingScore': score
})

# Preprocess
scaler = StandardScaler()
X = scaler.fit_transform(df)


STEP 2: Define Helper to Compute WCSS

This computes the total within-cluster sum of squares (dispersion):

In [None]:
def compute_wcss(X, n_clusters):
    kmeans = KMeans(n_clusters=n_clusters, n_init=10, random_state=42)
    kmeans.fit(X)
    return kmeans.inertia_

STEP 3: Generate Reference (Random Uniform) Datasets

We simulate B reference datasets and compute WCSS on those.

In [None]:
def generate_reference_data(X, n_samples):
    mins = np.min(X, axis=0)
    maxs = np.max(X, axis=0)
    return np.random.uniform(mins, maxs, size=(n_samples, X.shape[1]))

STEP 4: Manual Gap Statistic Implementation

In [None]:
def gap_statistic(X, k_range, B=10):
    n_samples = X.shape[0]
    gap_values = []
    sdk = []

    for k in k_range:
        # Wk = actual data
        wk = compute_wcss(X, k)

        # Generate B reference datasets
        wk_refs = []
        for _ in range(B):
            X_ref = generate_reference_data(X, n_samples)
            wk_refs.append(compute_wcss(X_ref, k))
        
        log_wk_refs = np.log(wk_refs)
        log_wk = np.log(wk)

        # Gap statistic
        gap_k = np.mean(log_wk_refs) - log_wk
        sk = np.std(log_wk_refs) * np.sqrt(1 + 1/B)

        gap_values.append(gap_k)
        sdk.append(sk)

    return gap_values, sdk

STEP 5: Run the Gap Statistic and Plot

In [None]:
k_range = range(1, 11)
gap_vals, sk_vals = gap_statistic(X, k_range, B=10)

plt.figure(figsize=(8, 5))
plt.plot(k_range, gap_vals, marker='o')
plt.xlabel("Number of Clusters K")
plt.ylabel("Gap Value")
plt.title("Gap Statistic for Optimal K (Customer Demographics)")
plt.grid(True)
plt.show()


STEP 6: Selecting the Best K

You can choose the optimal K using the rule:
- Choose the smallest K such that
- Gap(K) ≥ Gap(K+1) - s_{k+1}

In [None]:
def optimal_k_by_gap(gap_vals, sk_vals):
    for k in range(0, len(gap_vals)-1):
        if gap_vals[k] >= gap_vals[k+1] - sk_vals[k+1]:
            return k + 1  # Since k is zero-indexed
    return len(gap_vals)

optimal_k = optimal_k_by_gap(gap_vals, sk_vals)
print(f"Optimal number of clusters (by Gap Statistic): {optimal_k}")

### Interpretation
If the optimal K = 4 (say), it means:
- Clustering the customer demographics data into 4 groups gives a compact, meaningful grouping.
- Going beyond that (e.g., K=5,6) doesn’t significantly reduce WCSS compared to random structure — so extra clusters aren’t meaningful.

### **What is Principal Component Analysis (PCA)?**
Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms high-dimensional data into a lower-dimensional form without losing much information.

🧩 Primary Features of PCA

|Feature	|Description  |
|-----------|-------------|
|✅ Dimensionality reduction 	|Reduces the number of variables (features) while retaining the most variance (information).|
|📊 Linear transformation 	|Projects data onto new orthogonal axes (principal components).|
|🔢 Principal components (PCs)	|New axes are linear combinations of the original features.|
|🧠 Ordered by variance	|PC1 captures the most variance, PC2 the second most, etc.|
|🔁 Unsupervised	|No labels are used — similar to clustering in that way.|
|🔍 Helps visualization	|Reduces data to 2D or 3D for visualizing patterns or clusters.|
|⚖️ Requires scaling	|Important to standardize features before applying PCA.|

🧠 How PCA Works (Conceptually)

Standardize the Data
- Scale variables (z-score) to have mean = 0 and std = 1.

Compute Covariance Matrix
- Measures how features vary with one another.

Compute Eigenvectors & Eigenvalues
- Eigenvectors define directions (principal components); eigenvalues define importance (variance explained).

Project Data
- Transform original data onto the selected principal components (usually top 2–3).

📌 Example Use Case

Suppose you have customer data with:
- Age, Income, Spending Score, Education Level, etc.
You want to reduce the dimensions to 2 or 3 while keeping the core structure intact — especially useful for clustering or visualization.

🔗 How PCA Relates to Cluster Analysis
- PCA and clustering complement each other in unsupervised learning.

|Relationship	|Explanation  |
|---------------|-------------|
|✅ Preprocessing for Clustering	|PCA reduces dimensionality and noise, helping clustering algorithms like KMeans perform better.|
|📉 Deals with multicollinearity	|Reduces redundancy between variables, making distance-based clustering more effective.|
|📊 Visualization	|PCA makes high-dimensional cluster structure easier to visualize in 2D or 3D plots.|
|🔎 Interpretability	|You can observe clusters forming along PC1 and PC2 axes, revealing structure in otherwise noisy or high-dimensional data.|

✅ PCA + Clustering: Typical Workflow
1. Start with high-dimensional dataset
2. Apply standard scaling (z-score)
3. Apply PCA (retain 2 or 3 components)
4. Visualize data in PC1-PC2 space
5. Apply clustering (e.g., KMeans or Hierarchical)
6. Analyze or validate cluster structure

Summary
|Concept	|PCA	|Clustering|
|-----------|-------|----------|
|Purpose	|Dimensionality reduction	|Group similar observations|
|Type	|Unsupervised	|Unsupervised  |
|Output	|Principal components (new axes)	|Cluster labels (groups)|
|Relationship	|PCA simplifies data for better clustering & visualization|


### How PCA Helps Improve Clustering
🧠 Why PCA Helps:
- Removes noise and redundancy (like correlated features)
- Reduces dimensionality, which:
    - Speeds up clustering
    - Makes it easier for K-Means to find compact groups
    - Improves cluster separation, especially for visualization

Example (Using Iris Dataset)

- We’ll apply clustering: On the original features
- Then on the PCA-reduced features
- And compare cluster validity metrics

In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score

# Load data
iris = load_iris()
X = iris.data

# Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Clustering on original features
kmeans_original = KMeans(n_clusters=3, random_state=42)
labels_orig = kmeans_original.fit_predict(X_scaled)

# Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Clustering on PCA features
kmeans_pca = KMeans(n_clusters=3, random_state=42)
labels_pca = kmeans_pca.fit_predict(X_pca)

# Validation Scores
def cluster_metrics(X, labels):
    return {
        "Silhouette": silhouette_score(X, labels),
        "Calinski-Harabasz": calinski_harabasz_score(X, labels),
        "Davies-Bouldin": davies_bouldin_score(X, labels)
    }

metrics_original = cluster_metrics(X_scaled, labels_orig)
metrics_pca = cluster_metrics(X_pca, labels_pca)

print("Original Feature Space:", metrics_original)
print("PCA-Reduced Feature Space:", metrics_pca)


Interpretation:
- Higher Silhouette and Calinski-Harabasz → Better separation
- Lower Davies-Bouldin → More compact and distinct clusters

In many real-world cases, PCA can enhance the clustering structure especially when features are highly correlated or noisy.

### Interpreting PCA Components

📌 Understanding PCA Loadings
You can check how each original feature contributes to each Principal Component:

This tells you how each original feature loads onto each principal component.

|Feature	|PC1	|PC2  |
|----------|---------|------|
|sepal length	|0.36	|-0.06|
|sepal width	|-0.08	|0.93 |
|petal length	|0.86	|0.25 |
|petal width	|0.36	|0.25 |

🔍 Interpretation:
- PC1 heavily weights petal length and petal width → captures petal size
- PC2 is strongly influenced by sepal width

So, cluster separation along PC1 is likely tied to petal dimensions, which is meaningful in the Iris dataset!

✅ Summary
|Step	|Purpose|
|-------|--------|
|PCA	|Reduce dimensions, simplify data|
|K-Means on PCA|	Often finds better, cleaner clusters|
|Validity metrics comparison|	Quantifies improvement|
|Loadings|	Explain what each PC represents|

In [None]:
import pandas as pd

loadings = pd.DataFrame(
    pca.components_.T,
    columns=["PC1", "PC2"],
    index=iris.feature_names
)

print(loadings)

# Principal Component Analysis (PCA)
What It Means: 
- PCA reduces the number of variables in the data by finding combinations of variables that capture the most information (variance).

Outcome Interpretation: 
- Each "principal component" explains a percentage of the total variance, helping simplify the data without losing much information.

Performance Measures:
- Explained Variance Ratio: Shows how much information each principal component holds; higher is better.

Lay Explanation: 
- PCA is like summarizing a book by keeping only the most important points, making data easier to work with without losing key insights.

Use Case: 
- Dimensionality reduction while retaining the most critical information.

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
reduced_data = pca.fit_transform(X)

## Principle Complonent Analyis (PCA)
In Principal Component Analysis (PCA), the goal is to reduce the number of dimensions while retaining as much variance (information) as possible. 

There are two primary ways to control the amount of variance retained:

✅ 1. Set the Number of Principal Components (n_components)
- You can directly specify how many principal components you want to keep — for example:

This means PCA will retain the first two components, which usually capture the largest share of variance. 
- However, you won’t know exactly how much variance is retained unless you calculate or plot it.

In [None]:
pca = PCA(n_components=2)

✅ 2. Set a Variance Threshold (Proportion of Variance Retained)
- Instead of fixing the number of components, you can set a target proportion of explained variance you wish to retain — such as 95%:

PCA will automatically determine the minimum number of components required to explain at least 95% of the total variance in the dataset. 
- This is useful for adaptive dimensionality reduction.

In [None]:
pca = PCA(n_components=0.95)

Summary

|Method	|How it Works	|When to Use  |
|-------|---------------|-------------|
|n_components=int	|Fix number of components manually	|When you want a specific number of dimensions|
|n_components=float	|Retain a target % of variance (e.g., 0.95)	|When you want to preserve a minimum amount of information|


Explained Variance Plot (Scree Plot)
- You can visualize how variance accumulates across components:

In [None]:
import matplotlib.pyplot as plt
import numpy as np

pca = PCA().fit(X_scaled)
explained_var_ratio = np.cumsum(pca.explained_variance_ratio_)

plt.plot(range(1, len(explained_var_ratio)+1), explained_var_ratio, marker='o')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Scree Plot')
plt.axhline(y=0.95, color='r', linestyle='--')
plt.grid(True)
plt.show()


### Understanding the relationship between the number of features and the number of parameters in a dimensionality reduction method—like Principal Component Analysis (PCA)
It is key to both interpreting and designing efficient models.

✅ Core Concept
- In dimension reduction, we project the original high-dimensional data (with many features) into a lower-dimensional space using a set of transformed axes (components). 
- The parameters we estimate are tied to how we create this new space.

🔁 Relationship Explained
- Let’s break it down using PCA as an example:

🔹 Suppose:
- You have a dataset with n_samples and n_features.
- You want to reduce the data to k principal components, where k < n_features.

📌 Parameters Estimated in PCA:
- Component Loadings (also called eigenvectors or weights):
    - PCA learns a weight (loading) for each feature in each component.
    - Shape: (n_features × k)
- So, PCA estimates n_features × k parameters for the transformation matrix.

Explained Variance:
- PCA also calculates the eigenvalues (amount of variance each component captures).
- You estimate k variance values, but this is small compared to the loadings.

🧠 Interpretation:
- More Features → More Parameters to Estimate:
    - If you start with 100 features and reduce to 5 components, PCA must estimate 100 × 5 = 500 parameters (loadings).
- More Components → Closer to Original Representation:
    - Retaining more components keeps more information but increases parameter count.
- Fewer Components → Simpler Model:
    - Reduces parameters and computation, but you may lose some variance.

🧮 Quick Formula:
- Number of parameters estimated ≈ n_features × k, where k is the number of retained components.

📊 Example:

If your original dataset has:
- 50 features
- 2000 samples
- You reduce to 10 components

Then:
- PCA estimates 50 × 10 = 500 parameters for the loadings matrix.

📌 Summary Table:
|Item	|Value  |
|-------|-------|
|Original Features	|n_features   |
|Reduced Components	| k  |
|Parameters Estimated|	n_features × k  |
|Goal	|Capture max variance with fewer dimensions  |

##### walk through a concrete example using simulated data to:
1. Generate a dataset with many features
2. Apply PCA with different k values
3. Calculate and interpret:
    - Number of parameters estimated
    - Variance retained

Step-by-Step PCA Example with Parameter Count

In [None]:
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# Step 1: Simulate data
np.random.seed(42)
n_samples = 1000
n_features = 20

X = np.random.rand(n_samples, n_features)
X = pd.DataFrame(X, columns=[f"feature_{i+1}" for i in range(n_features)])

# Step 2: Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Step 3: Fit PCA (retain all components to analyze)
pca = PCA()
pca.fit(X_scaled)

# Step 4: Calculate explained variance and loadings
explained_variance = pca.explained_variance_ratio_
cumulative_variance = np.cumsum(explained_variance)

# Step 5: Plot cumulative variance
plt.figure(figsize=(8, 5))
plt.plot(range(1, len(cumulative_variance)+1), cumulative_variance, marker='o')
plt.axhline(y=0.95, color='r', linestyle='--', label='95% variance')
plt.title('Cumulative Explained Variance by PCA Components')
plt.xlabel('Number of Components (k)')
plt.ylabel('Cumulative Variance Explained')
plt.grid(True)
plt.legend()
plt.show()


Now Let’s Interpret:

✅ How many components needed to retain 95% variance?

In [None]:
k_95 = np.argmax(cumulative_variance >= 0.95) + 1
print(f"Minimum components needed to retain 95% variance: {k_95}")

✅ Number of Parameters Estimated:

In [None]:
n_parameters = n_features * k_95
print(f"Parameters estimated for PCA transformation (loadings): {n_parameters}")

Interpretation:
- Suppose the output is:
    - Minimum components needed to retain 95% variance: 14
    - Parameters estimated for PCA transformation (loadings): 280

This means:
- You can reduce from 20 to 14 components and still retain 95% of the information.
- PCA estimates 280 parameters (20 features × 14 components) to form the transformation matrix (the loadings).

Projecting the original data onto the top k principal components gives you a reduced-dimension representation of your data — this is one of the main goals of PCA.
- Let’s walk through the steps to project your data onto the top k = 14 components (assuming we want to retain 95% variance from the previous example):


What this means:
- X_pca_df now holds the data projected into a new space with 14 orthogonal axes (principal components).
- Each principal component is a linear combination of the original 20 features, designed to maximize variance.

This lower-dimensional dataset can now be used for:
- Clustering (like K-Means)
- Visualization (with 2 or 3 components)
- Feeding into supervised learning models

✅ Why this is powerful:
- Reduces noise and redundancy in your features
- Improves model interpretability and training time
- Keeps the most informative variation in your data

In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import pandas as pd
import numpy as np

# Step 1: Standardize the data again (if not already done)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Step 2: Fit PCA with k=14 components
pca_14 = PCA(n_components=14)
X_pca_14 = pca_14.fit_transform(X_scaled)

# Step 3: Convert to DataFrame for easier viewing
X_pca_df = pd.DataFrame(X_pca_14, columns=[f'PC{i+1}' for i in range(14)])

# Show the transformed dataset
X_pca_df.head()


### Two major steps in any dimension reduction method

Objective:
Reduce high-dimensional data into fewer, more meaningful dimensions without losing significant information.

We do this in two interconnected steps, where each step feeds into the next.

##### **1. Feature Transformation / Projection into a New Space**
This step transforms the original high-dimensional features into a new set of dimensions (components), which are often linear combinations of the original features.

What Happens:
- Transform the original data (features) into a new coordinate system.
- These new axes (called principal components in PCA) are:
    - Orthogonal (uncorrelated)
    - Ordered by how much variance they capture from the original data

💡 Why It Matters:
- Helps identify the directions in which the data varies the most
- Converts correlated features into independent axes
- Removes redundancy

📊 Mathematically:
- Compute the covariance matrix of the standardized data.
- Perform eigendecomposition or SVD to extract:
    - Eigenvectors (principal axes)
    - Eigenvalues (amount of variance captured by each axis)

Example: In PCA,
- These are the principal components — new axes aligned to directions of maximum variance.
    - Find new orthogonal axes (principal components) in the feature space.
    - These axes are ordered by how much variance they capture from the data.
    - The transformation is linear (e.g., multiplying original data by eigenvectors).
- The transformation aims to reorient the data into a more compact and meaningful representation.
- Involves computing eigenvectors (directions) and eigenvalues (variance captured) of the data’s covariance matrix.

🧠 Goal: Find a new coordinate system that captures the most important patterns in fewer dimensions.

✅ Goal: Convert the original features into a new set of variables (called components, axes, or dimensions) that are combinations of the original ones.

##### **2. Dimensionality Truncation / Selection of Top-k Components**
After transformation, you select the top k components that explain most of the variance in the data, and discard the rest.

What Happens:
- Select the top k principal components that together capture most of the variance (e.g., 95%).
- Project the original data into this reduced space.

💡 Why It Matters:
- Reduces the number of dimensions while retaining the structure of the original dataset.
- Discards noisy or less informative components.

📊 Common Methods for Choosing k:
- Explained Variance Threshold (e.g., 95%)
- Scree Plot / Elbow Method
- Eigenvalues > 1 (Kaiser’s rule)

Example: In PCA:
- This is where you reduce the number of dimensions.
    - Evaluate each component’s explained variance (eigenvalues).
    - Select enough components to retain a chosen threshold of variance (e.g., 95%).
- The number of components k is chosen based on criteria like:
    - Cumulative explained variance (e.g., retain 95%)
    - Scree plot (elbow method)
    - Eigenvalues > 1 rule
- You then project the original data onto just these k components.

🧠 Goal: Keep the most informative structure of the data while reducing noise and complexity.

✅ Goal: Choose the top k components (from all transformed ones) that capture most of the information (variance) while reducing dimensionality.

🔁 How They Work Together
|Phase	|Description|
|--------|-----------|
|Step 1: Transformation	|Converts the original dataset to a new space with uncorrelated components|
|Step 2: Truncation	|Selects the most informative components to reduce dimensionality|

🧠 Analogy: Think of Step 1 as reorienting the camera to get the best angles of a scene. Step 2 is about cropping the picture to only keep the most important view.

📌 Summary Table
|Step	|Description |  Purpose    |
|-------|------------|-------------|
|1. Transform / Feature transformation (e.g., PCA axes)	|Re-express original data into new components (rotated axes). | Find new informative directions   |
|2. Select / Dimension selection (choose top k)	|Keep only the top components that explain most of the variance.  |  Reduce dimensions, retain meaningful info |\

### Explore the mathematical implementation of PCA using eigenvalues and eigenvectors
How PCA reduces dimensions while preserving most of the variance.

🔢 PCA: Mathematical Steps Using Eigenvalues & Eigenvectors
- Let’s assume we have a dataset X with n samples and d features (dimensions).

✅ Step 1: Standardize the Data

PCA is sensitive to scale, so we first standardize:
$$ Z = \frac{X - \mu}{\sigma} $$
- Each feature gets zero mean and unit variance.

✅ Step 2: Compute the Covariance Matrix

The covariance matrix shows how variables co-vary:
$$ \text{Cov(Z)} = \frac{1}{n - 1} Z^T Z$$
- It will be a d x d symmetric matrix showing how each pair of features relates.

✅ Step 3: Calculate Eigenvalues and Eigenvectors

Solve:
$$ \text{Cov(Z)} \cdot v = \lambda \cdot v $$
- Where:
    - v = eigenvector (principal component direction)
    - λ = eigenvalue (variance explained by that component)
    - The eigenvectors are the new axes (principal components).
    - The eigenvalues indicate the importance (variance) of each component.

✅ Step 4: Sort and Select Top k Eigenvectors
- Sort eigenvectors by their eigenvalues in descending order.
- Select top k eigenvectors $V_k \isin R^{d \times k}$

✅ Step 5: Project the Data
- Project the standardized data onto the selected components:
$$ Z_{projected} = Z \cdot V_k $$

This gives the data in the new reduced-dimensional space.

Example with Code: 2D Dataset Reduced to 1D

Interpretation:
- The red arrows are the principal axes (eigenvectors).
- The length reflects the amount of variance explained (eigenvalue).
- If we projected the data onto PC1 (longer arrow), we reduce the data from 2D → 1D, keeping the maximum variance.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Generate synthetic 2D data
np.random.seed(0)
X = np.random.multivariate_normal(mean=[0, 0], cov=[[3, 2.5], [2.5, 3]], size=200)

# Step 1: Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Step 2: Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Step 3: Visualize
plt.figure(figsize=(8,6))
plt.scatter(X_scaled[:,0], X_scaled[:,1], alpha=0.5, label='Original Data')
for length, vector in zip(pca.explained_variance_, pca.components_):
    v = vector * 3 * np.sqrt(length)
    plt.arrow(0, 0, v[0], v[1], width=0.02, color='red', label='Principal Component')
plt.title('PCA: Original Data and Principal Components')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.grid(True)
plt.axis('equal')
plt.legend()
plt.show()


In [None]:
print("Eigenvalues (explained variance):", pca.explained_variance_)
print("Explained variance ratio:", pca.explained_variance_ratio_)


### Interpreting principal components (PCs)
Crucial for understanding what each component represents in terms of original features.

**How to Interpret Principal Components**
After applying PCA, each principal component is a linear combination of the original features:
$$ PC_j = w_{1j} x_1 + w_{2j} x_2 + ... + w_{dj} x_d $$
- Where:
    - $w_{ij}$ is the loading (weight) of feature $𝑥_𝑖$ in principal component 𝑗
    - The larger the absolute value of $w_{ij}$ , the more that feature contributes to the component.

✅ Step-by-Step Interpretation
Look at the Loadings (aka Component Weights)

Use pca.components_ from sklearn:

In [None]:
import pandas as pd

# Assume you have a PCA model and feature names
loadings = pd.DataFrame(
    pca.components_.T,  # transpose to align features in rows
    columns=[f'PC{i+1}' for i in range(pca.n_components_)],
    index=feature_names  # list of original feature names
)

print(loadings)


Interpret Each PC
- Identify the top contributing features (those with highest absolute loadings).
- See if those features have thematic similarity (e.g., income-related, age-related).

Example Interpretation
- Let’s say for customer demographic features:

|          | PC1    | PC2    | PC3|
|----------|--------|--------|----|
|Age      |0.52   |-0.02   |0.13 |
|Income   | 0.49  |  0.18  | 0.23|
|Spending | 0.51  | -0.45  |-0.09|
|Visits   |-0.11  |  0.85  | 0.30|

🔹 Interpretation:

PC1 (first principal component):
- High loadings on Age, Income, Spending
- Interpretation: a general economic status dimension

PC2:
- High positive weight on Visits, negative on Spending
- Interpretation: maybe frequency vs amount of spending

PC3:
- Small or mixed values — might not be as easy to interpret or might capture minor variations

Heatmap of Loadings

In [None]:
import seaborn as sns
plt.figure(figsize=(8,5))
sns.heatmap(loadings, annot=True, cmap='coolwarm')
plt.title("Principal Component Loadings")
plt.show()


🧠 Key Tips:
- Signs of loadings matter for direction but absolute values matter for strength of contribution.
- Group features thematically when interpreting PCs.
- If PCs are hard to interpret, you might try rotated PCA (e.g., varimax) for better interpretability.

###  Characteristics of the First Principal Component
The first principal component (PC1) plays a central role in Principal Component Analysis (PCA). Here's a detailed explanation of its characteristics, how it’s derived, and why it’s important:

🔷 Characteristics of the First Principal Component (PC1)

✅ 1. Direction of Maximum Variance
- PC1 captures the largest possible variance in the dataset.
- It identifies the single best direction in feature space along which data points are most spread out.

$$ PC1 = arg \max_{||w|| = 1} Var(Xw)$$
- Where:
    - w s the direction (vector of weights), and X is the standardized data.

✅ 2. Linear Combination of Original Features
- PC1 is a weighted sum of the original features:
$$ PC_1 = w_{1} x_1 + w_{2} x_2 + ... + w_{d} x_d $$
- The weights $w_i$ (called loadings) tell you how much each feature contributes to PC1.
- Features with higher absolute weights contribute more.

✅ 3. Orthogonality
- PC1 is orthogonal (statistically uncorrelated) to all subsequent principal components (PC2, PC3...).
- This ensures no redundancy between components.

✅ 4. Eigenvalue Correspondence
- PC1 corresponds to the largest eigenvalue of the covariance matrix.
- The associated eigenvector is the direction (loading vector) of PC1.

✅ 5. Data Projection and Interpretation
- Data projected onto PC1 (i.e., dot product of data with PC1) gives you a 1D representation capturing maximum variance.
- Often used for visualization and pattern discovery.

✅ 6. Explained Variance
- The explained variance ratio for PC1 shows how much of the total variance it captures:
$$ \text{Explained Variance Ratio of PC1} = \frac{\lambda_{1}}{\sum^d_{i = 1} \lambda_i} $$
- Where:
    - $\lambda$ is the eigenvalue corresponding to PC1.

A high ratio (e.g., 70%+) indicates that PC1 alone gives a strong summary of the dataset.

🧠 Intuition Example:
- Suppose you have data on age, income, and spending score.
- PC1 might load highly on income and spending, and low on age.
- This could mean PC1 represents an "economic engagement" dimension.
- Individuals with high PC1 scores are higher spenders and earners.

##### Computing and interpreting the first principal component (PC1) using a realistic customer demographic dataset
use features like:
- Age
- Income
- Spending Score
- Visit Frequency

These are common in customer segmentation.

✅ Step-by-Step: Compute and Interpret PC1
🔹 Step 1: Import Libraries and Simulate Example Data

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import seaborn as sns

# Simulated customer demographic data
data = pd.DataFrame({
    'Age': [25, 45, 31, 35, 52, 23, 40, 60, 48, 33],
    'Income': [40000, 80000, 52000, 58000, 100000, 39000, 70000, 120000, 85000, 61000],
    'SpendingScore': [60, 30, 50, 40, 20, 65, 35, 15, 25, 45],
    'VisitFrequency': [4, 2, 3, 3, 1, 5, 2, 1, 1, 3]
})

🔹 Step 2: Standardize the Data

PCA is sensitive to scale, so we use z-score normalization.

In [None]:
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)


🔹 Step 3: Apply PCA

In [None]:
pca = PCA()
pca_result = pca.fit_transform(scaled_data)

# Create a DataFrame of PCA results
pca_df = pd.DataFrame(pca_result, columns=[f'PC{i+1}' for i in range(len(data.columns))])


🔹 Step 4: Interpret PC1 (Loadings)

In [None]:
# Loadings (components)
loadings = pd.DataFrame(pca.components_.T,
                        index=data.columns,
                        columns=[f'PC{i+1}' for i in range(len(data.columns))])

print("Loadings for PC1:")
print(loadings['PC1'].sort_values(ascending=False))


This tells us how much each original feature contributes to PC1.

✅ Example output (interpretation):

🔍 Interpretation:
- PC1 contrasts Income & Age (positive) with Spending Score & Visit Frequency (negative).
- A high PC1 score = older, higher-income customers who spend and visit less.
- A low PC1 score = younger customers with lower income who spend more and visit often.

🔹 Step 5: Explained Variance
- This gives the proportion of total variance captured by PC1.

In [None]:
print("Explained Variance Ratio for PC1:", pca.explained_variance_ratio_[0])

🔹 Step 6: Optional – Visualize Customers in PC Space

In [None]:
plt.figure(figsize=(8, 5))
sns.scatterplot(x=pca_df['PC1'], y=pca_df['PC2'])
plt.title("Customer Distribution in PC1 vs PC2 Space")
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.axhline(0, color='gray', linestyle='--')
plt.axvline(0, color='gray', linestyle='--')
plt.grid(True)
plt.show()


###  What Does It Mean to "Project a Point onto a Line"?

➤ In simple terms:
- When you project a point onto a line, you are dropping a perpendicular from the point to the line and identifying the location on the line where this perpendicular lands.
- That new point on the line is the projection.

🧠 Why Do We Do This?
- Projection answers the question:
    - "If I could represent this point using just this line, where would it fall?"
- In PCA, the line might be a principal component (like PC1), and the original point is a data point in high-dimensional space.

🔢 Mathematically Speaking

Let’s say:
- You have a vector x (your point), And a line through the origin in the direction of a unit vector u.

➤ The projection of x onto u is:
$$ proj_u(x) = (x \cdot u) \cdot u$$
- Where:
    - $𝑥 \cdot 𝑢$ is the dot product of x and u,
    - This gives the length of the projection in the direction of u,
    - Then we multiply by u to get the vector location on the line.

📍Geometric Meaning
- Imagine a flashlight shining perpendicular to the line u.
- The shadow of point x on that line is the projection.

In [None]:
import numpy as np

x = np.array([3, 4])      # our point
u = np.array([1, 0])      # x-axis unit direction

u = u / np.linalg.norm(u)   # ensure unit length
proj = np.dot(x, u) * u

print("Projection:", proj)


➡️ This says: If we only cared about the x-axis, then the point [3, 4] becomes [3, 0] when projected.

🔷 In PCA: Why is Projection Important?

PCA finds the directions (principal components) that explain variance. Then:
- Each data point is projected onto these directions,
- Those projections become the new features (PC1, PC2, etc.),
- This reduces dimensions and retains meaningful variation.

So, when you reduce from 10 features to 2, you're saying:
- "I want the 2D projections of all points on the best two lines capturing the most spread." 

### Compute loadings for principal components in Principal Component Analysis (PCA)
This solves an optimization problem with certain mathematical constraints. These constraints ensure that the solution is unique, meaningful, and preserves the geometry of the data.

##### **🔷 What Are Loadings?**
- Loadings are the coefficients or weights assigned to each original variable when forming a principal component.
- Each principal component is a linear combination of the original variables.
$$ PC_1 = w_{1} x_1 + w_{2} x_2 + ... + w_{p} x_p $$
- Where 
    - w = [$w_1, w_2, ..., w_p$] is the loading vector for PC1.

##### **🔧 The Optimization Problem in PCA**
We want to maximize the variance of the projected data on the principal component, i.e.:
$$ \text{maximize Var (Xw)} = w^T S_w $$
- Where:
    - X is the standardized data matrix
    - S is the covariance matrix of X,
    - w is the vector of loadings or weights.

##### **✅ The Key Constraint: Unit Length**
$$ w^T w = 1 $$

The loading vector must be a unit vector (i.e., it has length 1).

❓ Why this constraint?
- It prevents the trivial solution where you just scale the weights up indefinitely to increase variance.
- It ensures that the principal component is a direction, not a magnitude.
- Without it, the optimization would be unbounded (you could keep increasing variance by increasing w).

🔄 Orthogonality Constraints (for subsequent components)
- After computing the first principal component:
    - The second component must be orthogonal (perpendicular) to the first:
$$ w_1^T w_2 = 0 $$
- And this continues for PC3, PC4, etc.
- So for each new component $𝑤_𝑘$, the constraint is:
$$ w_k^T w_j = 0  \text{for all j < k}$$

This ensures that each component captures new, uncorrelated information.

🔑 Summary of Constraints
|Constraint Type	|Mathematical Expression	|Purpose  |
|-------------------|---------------------------|---------|
|Unit length  | $𝑤^𝑇 𝑤 = 1$ | Normalize the direction vector|
|Orthogonality	| $ w_k^T w_j = 0  \text{for all j < k}$| 	Ensure components are uncorrelated|
|Maximize variance	| $\text{maximize} 𝑤^𝑇 𝑆_𝑤$ |Capture most information in fewer dimensions|

💡 Insight
These constraints lead to a classic eigenvalue problem. When you solve:
$$ Sw = \lambda w$$
- Where:   
    - $\lambda$ : the eigenvalue (variance explained),
    - w : the eigenvector (loading vector).

##### Step-by-step example in Python to compute the loadings (eigenvectors) for principal components using eigenvalue decomposition.

In [None]:
# Step 1: Import Libraries and Create the Dataset

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler

# Create a small synthetic dataset
data = {
    'Age': [25, 30, 45, 35, 23],
    'Income': [50000, 60000, 80000, 72000, 45000],
    'SpendingScore': [60, 70, 45, 55, 65]
}

df = pd.DataFrame(data)
print(df)

# Step 2: Standardize the Data (important for PCA!)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df)

print("Standardized Data:\n", X_scaled)

# Step 3: Compute the Covariance Matrix
# PCA is based on the covariance matrix of the data.

# Compute the covariance matrix
cov_matrix = np.cov(X_scaled.T)  # Transpose to get features as rows
print("Covariance Matrix:\n", cov_matrix)

# Step 4: Compute Eigenvalues and Eigenvectors
# This gives us the variance explained and the loadings.

eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)

print("Eigenvalues:\n", eigenvalues)
print("Eigenvectors (loadings):\n", eigenvectors)

# Output

Eigenvectors = (columns = PC1, PC2, PC3):
[[ 0.58, -0.80, 0.12],
 [ 0.58,  0.52, 0.63],
 [ 0.58,  0.27, -0.77]]


Interpreting the Output
- Eigenvalues represent the amount of variance explained by each component.
- Eigenvectors are the loadings — the weights for original features that define each principal component.

Each column in eigenvectors is a principal component.

The first principal component (PC1) is:
$$PC1=0.58⋅Age+0.58⋅Income+0.58⋅SpendingScore $$

→ This suggests all features contribute equally to PC1.

The second component (PC2) contrasts Age (negatively) with Income and Spending.

In [None]:
# Step 5: Project the Data onto Principal Components
# Now that we have eigenvectors, we project the data to get PCA-transformed values.

# Project the original standardized data onto the principal components
X_pca = X_scaled @ eigenvectors

# Put into DataFrame for inspection
pca_df = pd.DataFrame(X_pca, columns=["PC1", "PC2", "PC3"])
print(pca_df)


### Principal Component Analysis (PCA) can be interpreted in two complementary ways

##### **Primary Interpretation of PCA (Variance Maximization)**
This is the most common interpretation of PCA:
- PCA finds new axes (principal components) such that the first captures the most variance in the data, the second captures the most remaining variance orthogonal to the first, and so on.
- Each principal component is a linear combination of the original features.
- The first principal component (PC1) is the direction in feature space along which the data varies the most.
- The components are orthogonal (uncorrelated) to each other.
- You reduce dimensions by keeping the top k components that explain most variance.

##### **Alternative Interpretation: Projection onto a Low-Dimensional Subspace**
This perspective comes from linear algebra and reconstruction error:
- PCA finds a lower-dimensional subspace that best approximates the original data by minimizing the reconstruction error (i.e., minimizing the squared distance between original data and its projection).

🔍 What does this mean?
- Suppose you project high-dimensional data (say 10D) onto a lower-dimensional space (say 2D).
- PCA finds the 2D plane that best approximates the original data — where the orthogonal (perpendicular) projection of each data point onto this plane results in minimal loss of information.
- This is equivalent to compressing the data and reconstructing it back with the least squared error.

In mathematical terms:
- PCA minimizes:
$$ \sum || x_i - \hat{x}_i ||^2 $$
- Where:
    - $x_i$  is the original point, and  
    - $\hat{x_i}$ is its projection onto the low-dimensional subspace.

Comparing the Two Interpretations

|Perspective	|Description	|Optimization Goal|
|---------------|---------------|-----------------|
|Variance Maximization	|Finds directions with most variance	|Maximize variance captured  |
|Projection View (Reconstruction)	|Finds best low-rank approximation of the data	|Minimize reconstruction error  |

Both interpretations are mathematically equivalent — they lead to the same principal components via eigen decomposition or SVD — but provide different intuitions.

🎓 When to Use Which Interpretation?
- Use variance-based interpretation when your goal is data exploration, dimensionality reduction, or identifying important directions in the data.
- Use projection-based interpretation when you're thinking about compression, reconstruction, or low-rank approximations (e.g., in image compression or latent variable models).

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import seaborn as sns

# Generate a simple 2D dataset with correlated features
np.random.seed(42)
x = np.random.normal(0, 1, 100)
y = 2 * x + np.random.normal(0, 0.5, 100)  # Strong linear relationship
data = np.vstack((x, y)).T

# Standardize the data
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)

# Fit PCA with 2 components
pca = PCA(n_components=2)
data_pca = pca.fit_transform(data_scaled)
data_reconstructed = pca.inverse_transform(data_pca)

# Principal component directions
components = pca.components_

data_mean = np.mean(data_scaled, axis=0)

# Plotting
fig, ax = plt.subplots(1, 2, figsize=(14, 6))

# Original data and principal component directions
ax[0].scatter(data_scaled[:, 0], data_scaled[:, 1], alpha=0.5, label='Original Data')
for length, vector in zip(pca.explained_variance_, components):
    v = vector * 2 * np.sqrt(length)
    ax[0].arrow(data_mean[0], data_mean[1], v[0], v[1],
                color='red', head_width=0.1, label='Principal Component')
ax[0].set_title("Original Data with Principal Components")
ax[0].axis('equal')
ax[0].legend()

# Projection view (reconstruction from first PC only)
pca_1d = PCA(n_components=1)
data_pca_1d = pca_1d.fit_transform(data_scaled)
data_reconstructed_1d = pca_1d.inverse_transform(data_pca_1d)

ax[1].scatter(data_scaled[:, 0], data_scaled[:, 1], alpha=0.5, label='Original Data')
ax[1].scatter(data_reconstructed_1d[:, 0], data_reconstructed_1d[:, 1], color='red', alpha=0.8, label='Reconstructed (1 PC)')
for i in range(len(data_scaled)):
    ax[1].plot([data_scaled[i, 0], data_reconstructed_1d[i, 0]],
               [data_scaled[i, 1], data_reconstructed_1d[i, 1]], 'gray', linewidth=0.5)
ax[1].set_title("Projection View (Minimizing Reconstruction Error)")
ax[1].axis('equal')
ax[1].legend()

plt.tight_layout()
plt.show()


##### Left Plot: Variance Maximization View
- The red arrows show the principal components.
- The first principal component (longer arrow) captures the most variation in the data — it's aligned with the diagonal trend of the data cloud.
- This aligns with the variance maximization perspective.

##### Right Plot: Projection & Reconstruction View
- Red points are the reconstructions using only the first principal component.
- Gray lines connect the original data points to their projections on the first PC.
- PCA chooses this line to minimize the total squared distance (reconstruction error).
- This aligns with the low-rank approximation view.


### 📌 What Is the First Principal Component (PC1)?
The first principal component (PC1) is the linear combination of the original features that:
- Captures the largest amount of variability (or variance) in the data.
- Defines the direction along which the data varies the most.
- Is a weighted sum of the original features (based on eigenvectors of the covariance matrix).

🔍 What Does "Information Content" Mean?
In PCA, "information content" refers to the amount of total variance in the dataset that is captured by a principal component.

So:

🔹 The first principal component captures the maximum possible variance that any single axis in the feature space can capture.

Mathematically:
- Let total variance = sum of variances of all principal components = trace of covariance matrix.
- The explained variance ratio of PC1:

$$ \text{ Explained Variance Ratio (PC1)} = \frac{\lambda_1}{\sum^p_{j = 1} \lambda_j}$$
- where:
    - $\lambda$ is the eigenvalue for PC1, and  
    - $\lambda_j$ are eigenvalues for all components.

🧠 Interpreting the Information Content
- If PC1 explains 80% of the variance, it means the spread of data along this one direction retains 80% of the original data structure.
- The features with the highest loadings (coefficients) in PC1 are the most influential in defining this primary direction of variability.
- Often, PC1 can reveal:
    - Customer segments based on behavior patterns.
    - Key driving factors (e.g., income or age) in a population.
    - Dominant relationships among correlated features.

📊 Example: Customer Demographic Data

Suppose PC1 has high loadings for:
- income (+0.7),
- education level (+0.6), and
- low for others.

Then:
- PC1 is essentially a "socioeconomic status" component, and the major variation in customer data is explained by this latent factor.

✅ Summary
|Aspect|	Description|
|------|---------------|
|PC1|	Direction of maximum variance in the data|
|Information content|	% of total variance PC1 explains|
|Loadings (weights)|	Show which original features contribute most|
|Interpretation|	Tells you what underlying factor drives the largest differences in your data|

### Effect of Zero Correlation Between the First and Second Principal Component
🔹 What Does It Mean?

In PCA, the principal components (PCs) are constructed to be:
- Orthogonal to each other (i.e., 90° apart in high-dimensional space).
- Therefore, they are uncorrelated with one another.

So when we say:
- ❝The first and second principal components have zero correlation❞
It means:
- Changes in the first component do not predict changes in the second.
- They capture independent patterns of variation in the data.

Mathematically, the covariance between PC1 and PC2 is zero:
$$ Cov(PC1, PC2) = 0$$

##### 🔍 Why Is This Important?

No Redundant Information:
- Each principal component contributes unique variance not captured by earlier components.

Simplifies Analysis:
- You can study PC1, PC2, etc., separately without worrying about multicollinearity.

Enables Orthogonal Projection:
- Data can be projected onto these components like independent axes of variation.

##### 📊 Visual Interpretation
Imagine a PCA plot where:
- PC1 is the x-axis.
- PC2 is the y-axis.

If these components are uncorrelated, then:
- Data spread in the PC1 direction is independent of spread in the PC2 direction.
- The scatterplot appears as a cloud without diagonal structure.

🧠 Analogy
- Think of a customer segmentation PCA:
    - PC1 might represent “wealth-related variation” (income, credit score).
    - PC2 might represent “age-related variation” (age, family size).
- They capture different, independent traits:
    - High-income customers could be young or old — there's no direct relationship between PC1 and PC2.

✅ Summary Table
|Property|	Implication|
|--------|-------------|
|Zero correlation|	Each PC adds unique, non-overlapping information|
|Orthogonality|	Vectors (directions) are 90° apart|
|No multicollinearity|	PCs are uncorrelated; simplifies regression and clustering|
|Interpretation|	Helps isolate underlying dimensions of variation|

### ✅ What Does "Orthogonal" Mean in PCA?
In PCA, the principal components (PCs) are constructed such that:

🔹 Each principal component is orthogonal to (i.e., at a 90° angle from) every other component.
Mathematically, orthogonality means:

$$ PC_i \cdot PC_j = 0 \text{ for all } i \neq j $$

This is the dot product = 0, which implies the vectors are perpendicular.

### Why Is Orthogonality Important in PCA?

|Benefit	|Explanation |
|-----------|------------|
|Uncorrelated Components|	Orthogonality ensures each principal component captures a unique, independent direction of variance in the data. There is no redundancy.|
|Simplifies Interpretation|	Because there's no overlap, you can analyze one PC without worrying about its relationship to the others.|
|Avoids Multicollinearity|	Orthogonal PCs are ideal inputs for downstream models (e.g., regression), because they’re not collinear.|
|Geometric Simplicity|	Orthogonal axes allow you to decompose the data into independent geometric directions — like projecting a shadow onto a set of perpendicular lines.|

🎯 Geometric Intuition

Imagine PCA in 2D:
- PC1 = line that best fits the spread of data (max variance).
- PC2 = next best line that’s perpendicular to PC1 and captures the remaining variance.

So if PC1 runs diagonally ↑↗, then PC2 must run ↖←, at 90° to PC1.
- ✅ The PCA algorithm forces this perpendicularity when solving for eigenvectors.

### 🧠 Mathematical Link
- PCA uses eigenvectors of the covariance matrix of the data.
- The eigenvectors of a symmetric matrix (like the covariance matrix) are always orthogonal.
- So, PCA naturally produces orthogonal components because of how the math works.

##### 💡 Big Picture First
- PCA (Principal Component Analysis) is built on linear algebra:
- It finds eigenvectors and eigenvalues of the covariance matrix of your data.

These eigenvectors become the principal components — the directions along which the data varies the most.

🔹 What is a Covariance Matrix?
- The covariance matrix tells you how much two features vary together.
- For a dataset 𝑋 with 𝑛 observations and p features (columns), the covariance matrix $\sum$ is a 𝑝 $\times$ matrix where:
$$ \sum_{ij} = Cov(X_i, Y_j) $$

So:
- The diagonal entries are the variances of each feature.
- The off-diagonal entries are the covariances between features.

👉 Example:
Suppose your features are:
- Age
- Income
- Spending Score

🔹 What are Eigenvectors and Eigenvalues?
- Given a square matrix (like the covariance matrix), an eigenvector is a vector that doesn’t change direction when the matrix is applied to it — it only gets stretched or squished.
$$ A \cdot v = \lambda \cdot v $$
- Where:
    - A is your covariance matrix.
    - 𝑣 is the eigenvector.
    - $\lambda$ is the eigenvalue (a scalar).

🔑 Eigenvectors point in the direction of the data's main axes of variation.

🔑 Eigenvalues tell you how much variance is along each eigenvector.

🔹 How This Applies to PCA

Step 1: Standardize the Data
- You often start by scaling your features (z-score or min-max).

Step 2: Compute Covariance Matrix
- Get the relationships between features.

Step 3: Compute Eigenvalues & Eigenvectors
- These define the new axes (principal components).
- Each eigenvector becomes a principal component direction.
- Each eigenvalue tells how important that direction is.

Imagine you plot 2D data like this:
- PC1 is the direction with the maximum spread of data — it aligns with the first eigenvector. 
- PC2 is perpendicular and captures the next most variance.

📊 Practical Implication: Uncorrelated Features
- Even if your original features (like income, age, and education) are correlated, the principal components are not.

|Original Feature Space|	PCA-Transformed Space |
|Variables may be correlated|	Components are orthogonal (uncorrelated)|
|Complex relationships|	Clean, independent axes of variation|

In [None]:
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Fake data
X = np.array([[2.5, 2.4],
              [0.5, 0.7],
              [2.2, 2.9],
              [1.9, 2.2],
              [3.1, 3.0]])

# 1. Standardize
X_std = StandardScaler().fit_transform(X)

# 2. Compute covariance matrix
cov_matrix = np.cov(X_std.T)

# 3. Eigen decomposition
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)

print("Covariance Matrix:\n", cov_matrix)
print("\nEigenvalues:\n", eigenvalues)
print("\nEigenvectors:\n", eigenvectors)

This tells us:
- PC1 = vector [0.707, 0.707] → explains 95% of variance
- PC2 = vector [-0.707, 0.707] → explains 5% of variance

✅ Summary
|Concept|	Meaning|
|-------|----------|
|Covariance Matrix|	Measures how features vary together|
|Eigenvectors|	Principal directions of variance (axes of PCA)|
|Eigenvalues|	Magnitude of variance along each direction|
|Orthogonality|	PCA components are perpendicular (uncorrelated)|
|Why it matters|	Enables dimension reduction while preserving structure|

##### Visualize the eigenvectors as arrows over your data to see how PCA rotates the space?

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler

# Original 2D dataset
X = np.array([[2.5, 2.4],
              [0.5, 0.7],
              [2.2, 2.9],
              [1.9, 2.2],
              [3.1, 3.0],
              [2.3, 2.7],
              [2, 1.6],
              [1, 1.1],
              [1.5, 1.6],
              [1.1, 0.9]])

# Step 1: Standardize the data
X_std = StandardScaler().fit_transform(X)

# Step 2: Compute covariance matrix
cov_matrix = np.cov(X_std.T)

# Step 3: Compute eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)

# Plotting
plt.figure(figsize=(8, 6))
plt.scatter(X_std[:, 0], X_std[:, 1], alpha=0.6)
origin = np.mean(X_std, axis=0)  # center of data

# Plot eigenvectors
for i in range(len(eigenvectors)):
    vec = eigenvectors[:, i]
    plt.quiver(*origin, vec[0], vec[1], angles='xy', scale_units='xy', scale=1.5, color=['red', 'green'][i], label=f'PC{i+1}')

plt.axhline(0, color='gray', linewidth=0.5)
plt.axvline(0, color='gray', linewidth=0.5)
plt.gca().set_aspect('equal', adjustable='box')
plt.title('Data with Principal Component Directions')
plt.xlabel('Standardized Feature 1')
plt.ylabel('Standardized Feature 2')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()


In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler

# Original 2D dataset
X = np.array([[2.5, 2.4],
              [0.5, 0.7],
              [2.2, 2.9],
              [1.9, 2.2],
              [3.1, 3.0],
              [2.3, 2.7],
              [2, 1.6],
              [1, 1.1],
              [1.5, 1.6],
              [1.1, 0.9]])

# Step 1: Standardize the data
X_std = StandardScaler().fit_transform(X)

# Step 2: Compute covariance matrix
cov_matrix = np.cov(X_std.T)

# Step 3: Compute eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)

# Plotting
plt.figure(figsize=(8, 6))
plt.scatter(X_std[:, 0], X_std[:, 1], alpha=0.6)
origin = np.mean(X_std, axis=0)  # center of data

# Plot eigenvectors
for i in range(len(eigenvectors)):
    vec = eigenvectors[:, i]
    plt.quiver(*origin, vec[0], vec[1], angles='xy', scale_units='xy', scale=1.5, color=['red', 'green'][i], label=f'PC{i+1}')

plt.axhline(0, color='gray', linewidth=0.5)
plt.axvline(0, color='gray', linewidth=0.5)
plt.gca().set_aspect('equal', adjustable='box')
plt.title('Data with Principal Component Directions')
plt.xlabel('Standardized Feature 1')
plt.ylabel('Standardized Feature 2')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()


✅ Covariance Matrix
- The covariance matrix represents the relationships between each pair of features.
- In this 2D case, it shows how Feature 1 and Feature 2 vary together.
- If the off-diagonal values are large, it suggests strong correlation.

✅ Eigenvectors of the Covariance Matrix
- The eigenvectors point in the directions of the principal components (PCs).
- These are new axes that are orthogonal (perpendicular) and aligned with the directions of maximum variance in the data.

✅ What You See in the Plot
- Red Arrow (PC1): The first principal component—points in the direction where the data varies the most. It's the major "trend" in the data.
- Green Arrow (PC2): The second principal component—perpendicular to PC1, capturing the second-most variance.

✅ Summary
|Term|	Meaning|
|------|----------|
|Eigenvector|	Direction of a principal component (axis of maximum variance)|
|Eigenvalue|	Magnitude of variance captured by its corresponding eigenvector|
|PC1|	Captures the most variance|
|PC2|	Orthogonal to PC1, captures remaining variance.|

##### Compute how much variance each PC explains

✅ Explained Variance Ratio (How much variance each principal component captures):
- PC1 explains 96.3% of the total variance.
- PC2 explains only 3.7% of the total variance.

Together, they account for 100% of the variance (since we only have 2 dimensions).

📌 Interpretation
- Almost all the information in the dataset is captured by the first principal component.
- This means the data can effectively be reduced to 1 dimension without much loss.

In [None]:
# Calculate the explained variance ratio
explained_variance_ratio = eigenvalues / np.sum(eigenvalues)

# Show explained variance of each principal component
explained_variance_ratio, np.cumsum(explained_variance_ratio)  # cumulative sum also helpful to decide how many PCs to keep


### Expected Information Content of the Second Principal Component (PC2)
When you perform PCA on a dataset with two predictors (let’s say $𝑋_1$ and $𝑋_2$):

You will always get two principal components:
- PC1 (first principal component): Captures the most variance possible in the data.
- PC2 (second principal component): Captures the remaining variance that is uncorrelated with PC1.

🔍 Key Properties of PC2

Orthogonal to PC1:
- PC2 is at a right angle (perpendicular) to PC1.
- Ensures no overlap of information between PC1 and PC2 (i.e., zero correlation).

Captures Remaining Variance:
- If the original two features are highly correlated, PC1 captures most of the variance, and PC2 will contain very little.

This is exactly what we saw in your data:
- PC1: 96.3% of variance
- PC2: 3.7% of variance

So, PC2 doesn’t carry much useful information unless the features are not correlated.

Usefulness:
- If you want a compressed representation (e.g. for visualization or speed), you might drop PC2 entirely.
- But if small variations matter (e.g. in anomaly detection or fine classification), PC2 may still be important.

📊 Visual Intuition

Think of it like this:
- If your data is shaped like a long ellipse, PC1 lies along the major axis, PC2 along the minor axis.
- PC2 contains the tiny spread or scatter orthogonal to the main trend—it’s like measuring the "noise" or "deviation" from the main direction.

### Relationship Between Number of Principal Components and Number of Features
In Principal Component Analysis (PCA), the number of principal components (PCs) you can extract is bounded by the number of original features (variables).

✅ General Rule:

If your data has:
- p features (columns or variables),
- Then you can extract at most p principal components.

So:
- If you have 5 features → at most 5 principal components.
- If you have 10 features → at most 10 principal components.

Each principal component is a linear combination of all the original features.

✅ Why This Limit?

Mathematically:
- PCA is based on the eigen-decomposition of the covariance matrix.
    - A covariance matrix of shape p×p can have at most p eigenvectors and eigenvalues.
- Therefore, it can yield at most p principal components.

✅ Choosing Fewer Components
- Although you can extract all p components, you typically keep only a subset, say k < p, based on how much variance each component explains.

This allows you to:
- Reduce dimensionality
- Remove noise or redundant information
- Speed up models
- Improve interpretability

🧠 Intuition
|Term|	Meaning|
|-----|--------|
|Original Features (p)|	Raw variables in your dataset|
|Principal Components (≤ p)|	New axes that are combinations of the original features, ordered by variance explained|
|Retained PCs|	Often far fewer than p, chosen to capture ~95% of the total variance|

# Decision Trees and Random Forests
What It Means: 
- Decision trees split data based on conditions, creating branches that lead to a prediction. 
- Random forests use multiple trees to improve accuracy and reduce overfitting.

Outcome Interpretation: 
- Each "branch" shows how different conditions affect the outcome, 
- and random forests average the results of many trees for robust predictions.

Performance Measures:
- Accuracy: Proportion of correctly classified samples.
- Gini Index / Entropy: Used to measure the purity of the splits; lower values are better.

Lay Explanation: 
- Decision trees are like flowcharts that guide predictions based on conditions. 
- Random forests combine many trees to make stronger, more reliable decisions.

Use Case: 
- For classification or regression problems with non-linear relationships and high dimensionality.

# Survival Analysis (e.g., Cox Proportional Hazards)
What It Means: 
- Survival analysis predicts the time until an event occurs, such as customer churn or equipment failure.

Outcome Interpretation: 
- Each output shows the likelihood of the event happening over time, considering various risk factors.

Performance Measures:
- Concordance Index (C-Index): Measures the model’s ability to correctly rank predictions; values closer to 1 indicate better performance.

Lay Explanation: 
Survival analysis is like tracking how long something will last, based on factors that might speed it up or slow it down.

Use Case: 
- For time-to-event data, such as time until a customer churns or equipment fails.

Model Types: 
- Kaplan-Meier estimator, Cox Proportional Hazards Model.

In [None]:
from lifelines import CoxPHFitter
cph = CoxPHFitter()
cph.fit(data, 'time', event_col='event')
cph.predict_survival_function(data)

# Time Series Models (e.g., ARIMA)
What It Means: 
- Time series models account for:
    - trends, 
    - seasonality, and 
    - temporal dependencies in data collected over time, often used for forecasting future values.

Outcome Interpretation: 
- Each prediction is based on patterns in past data points, accounting for recent trends and cycles.

Performance Measures:
- Mean Absolute Percentage Error (MAPE): Shows the average prediction error in percentage terms.
- Root Mean Squared Error (RMSE): Measures the prediction accuracy; lower values mean better predictions.

Lay Explanation: 
- Time series models are like weather forecasts—they predict future values based on past patterns, like trends and cycles.

Use Case: 
- Forecasting for data with a temporal component (e.g., sales data, stock prices).

Model Types: 
- ARIMA, 
- SARIMA, 
- Exponential Smoothing.

In [None]:
from statsmodels.tsa.arima.model import ARIMA
model = ARIMA(time_series_data, order=(1,1,1))
model_fit = model.fit()
predictions = model_fit.forecast(steps=10)

# Metrics

In [None]:
# Functions to compute True Positives, True Negatives, False Positives and False Negatives

def true_positive(y_true, y_pred):
    tp = 0
    for yt, yp in zip(y_true, y_pred):
        if yt == 1 and yp == 1:
            tp += 1
    return tp

def true_negative(y_true, y_pred):
    tn = 0
    for yt, yp in zip(y_true, y_pred):
        if yt == 0 and yp == 0:
            tn += 1        
    return tn

def false_positive(y_true, y_pred):
    fp = 0
    for yt, yp in zip(y_true, y_pred):
        if yt == 0 and yp == 1:
            fp += 1       
    return fp

def false_negative(y_true, y_pred):
    fn = 0
    for yt, yp in zip(y_true, y_pred):
        if yt == 1 and yp == 0:
            fn += 1        
    return fn

In [None]:
FP = cnf_matrix.sum(axis=0) - np.diag(cnf_matrix) 
FN = cnf_matrix.sum(axis=1) - np.diag(cnf_matrix)
TP = np.diag(cnf_matrix)
TN = cnf_matrix.sum() - (FP + FN + TP)FP = FP.astype(float)
FN = FN.astype(float)
TP = TP.astype(float)
TN = TN.astype(float)# Sensitivity, hit rate, recall, or true positive rate
TPR = TP/(TP+FN)
# Specificity or true negative rate
TNR = TN/(TN+FP) 
# Precision or positive predictive value
PPV = TP/(TP+FP)
# Negative predictive value
NPV = TN/(TN+FN)
# Fall out or false positive rate
FPR = FP/(FP+TN)
# False negative rate
FNR = FN/(TP+FN)
# False discovery rate
FDR = FP/(TP+FP)
# Overall accuracy for each class
ACC = (TP+TN)/(TP+FP+FN+TN)

In [None]:
# implementation for table metrics:
import sklearn.metrics
import mathdef matrix_metrix(real_values,pred_values,beta):
CM = confusion_matrix(real_values,pred_values)
TN = CM[0][0]
FN = CM[1][0] 
TP = CM[1][1]
FP = CM[0][1]
Population = TN+FN+TP+FP
Prevalence = round( (TP+FP) / Population,2)
Accuracy   = round( (TP+TN) / Population,4)
Precision  = round( TP / (TP+FP),4 )
NPV        = round( TN / (TN+FN),4 )
FDR        = round( FP / (TP+FP),4 )
FOR        = round( FN / (TN+FN),4 ) 
check_Pos  = Precision + FDR
check_Neg  = NPV + FOR
Recall     = round( TP / (TP+FN),4 )
FPR        = round( FP / (TN+FP),4 )
FNR        = round( FN / (TP+FN),4 )
TNR        = round( TN / (TN+FP),4 ) 
check_Pos2 = Recall + FNR
check_Neg2 = FPR + TNR
LRPos      = round( Recall/FPR,4 ) 
LRNeg      = round( FNR / TNR ,4 )
DOR        = round( LRPos/LRNeg)
F1         = round ( 2 * ((Precision*Recall)/(Precision+Recall)),4)
FBeta      = round ( (1+beta**2)*((Precision*Recall)/((beta**2 * Precision)+ Recall)) ,4)
MCC        = round ( ((TP*TN)-(FP*FN))/math.sqrt((TP+FP)*(TP+FN)*(TN+FP)*(TN+FN))  ,4)
BM         = Recall+TNR-1
MK         = Precision+NPV-1   

mat_met = pd.DataFrame({'Metric':['TP','TN','FP','FN','Prevalence','Accuracy','Precision','NPV','FDR','FOR','check_Pos','check_Neg','Recall','FPR','FNR','TNR','check_Pos2','check_Neg2','LR+','LR-','DOR','F1','FBeta','MCC','BM','MK'],     
                        'Value':[TP,TN,FP,FN,Prevalence,Accuracy,Precision,NPV,FDR,FOR,check_Pos,check_Neg,Recall,FPR,FNR,TNR,check_Pos2,check_Neg2,LRPos,LRNeg,DOR,F1,FBeta,MCC,BM,MK]})   

return (mat_met)

In [None]:
# ROC Implementation

from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from matplotlib import pyplotfpr, tpr, thresholds = roc_curve(real_values, prob_values)

auc = roc_auc_score(real_values, prob_values)
print('AUC: %.3f' % auc)pyplot.plot(fpr, tpr, linestyle='--', label='Roc curve')
pyplot.xlabel('False Positive Rate')
pyplot.ylabel('True Positive Rate')
pyplot.legend()pyplot.show()

# Precision-recall implementation

precision, recall, thresholds = sklearn.metrics.precision_recall_curve(real_values,prob_values)pyplot.plot(recall, precision, linestyle='--', label='Precision versus Recall')
pyplot.xlabel('Recall')
pyplot.ylabel('Precision')
pyplot.legend()pyplot.show()

In [None]:
# function for get many metrics directly from sklearn

def sk_metrix(real_values,pred_values,beta):
Accuracy = round( sklearn.metrics.accuracy_score(real_values,pred_values) ,4)
Precision= round( sklearn.metrics.precision_score(real_values,pred_values),4 )
Recall   = round( sklearn.metrics.recall_score(real_values,pred_values),4 )   
F1       = round ( sklearn.metrics.f1_score(real_values,pred_values),4)
FBeta    = round ( sklearn.metrics.fbeta_score(real_values,pred_values,beta) ,4)
MCC      = round ( sklearn.metrics.matthews_corrcoef(real_values,pred_values)  ,4)   
Hamming  = round ( sklearn.metrics.hamming_loss(real_values,pred_values) ,4)   
Jaccard  = round ( sklearn.metrics.jaccard_score(real_values,pred_values) ,4)   
Prec_Avg = round ( sklearn.metrics.average_precision_score(real_values,pred_values) ,4)   
Accu_Avg = round ( sklearn.metrics.balanced_accuracy_score(real_values,pred_values) ,4)   

mat_met = pd.DataFrame({
'Metric': ['Accuracy','Precision','Recall','F1','FBeta','MCC','Hamming','Jaccard','Precision_Avg','Accuracy_Avg'],
'Value': [Accuracy,Precision,Recall,F1,FBeta,MCC,Hamming,Jaccard,Prec_Avg,Accu_Avg]})   

return (mat_met)


In [None]:
# Evaluation Metrics For Multi-class Classification

def accuracy(y_true, y_pred):
    
    """
    Function to calculate accuracy
    -> param y_true: list of true values
    -> param y_pred: list of predicted values
    -> return: accuracy score
    
    """
    
# Intitializing variable to store count of correctly predicted classes
    correct_predictions = 0
    for yt, yp in zip(y_true, y_pred):
        if yt == yp:
            correct_predictions += 1
    #returns accuracy
    return correct_predictions / len(y_true)

In [None]:
#Computation of macro-averaged precision

def macro_precision(y_true, y_pred):

    # find the number of classes
    num_classes = len(np.unique(y_true))

    # initialize precision to 0
    precision = 0
    
    # loop over all classes
    for class_ in list(y_true.unique()):
        
        # all classes except current are considered negative
        temp_true = [1 if p == class_ else 0 for p in y_true]
        temp_pred = [1 if p == class_ else 0 for p in y_pred]
        
        
        # compute true positive for current class
        tp = true_positive(temp_true, temp_pred)
        
        # compute false positive for current class
        fp = false_positive(temp_true, temp_pred)
        
        
        # compute precision for current class
        temp_precision = tp / (tp + fp + 1e-6)
        # keep adding precision for all classes
        precision += temp_precision
        
    # calculate and return average precision over all classes
    precision /= num_classes
    
    return precision

print(f"Macro-averaged Precision score : {macro_precision(y_test, y_pred) }")

# implement marco-averaged precision using sklearn
macro_averaged_precision = metrics.precision_score(y_test, y_pred, average = 'macro')
print(f"Macro-Averaged Precision score using sklearn library : {macro_averaged_precision}")

In [None]:
#Computation of micro-averaged precision

def micro_precision(y_true, y_pred):


    # find the number of classes 
    num_classes = len(np.unique(y_true))
    
    # initialize tp and fp to 0
    tp = 0
    fp = 0
    
    # loop over all classes
    for class_ in y_true.unique():
        
        # all classes except current are considered negative
        temp_true = [1 if p == class_ else 0 for p in y_true]
        temp_pred = [1 if p == class_ else 0 for p in y_pred]
        
        # calculate true positive for current class
        # and update overall tp
        tp += true_positive(temp_true, temp_pred)
        
        # calculate false positive for current class
        # and update overall tp
        fp += false_positive(temp_true, temp_pred)
        
    # calculate and return overall precision
    precision = tp / (tp + fp)
    return precision

print(f"Micro-averaged Precision score : {micro_precision(y_test, y_pred)}")


#  implement mirco-averaged precision using sklearn
micro_averaged_precision = metrics.precision_score(y_test, y_pred, average = 'micro')
print(f"Micro-Averaged Precision score using sklearn library : {micro_averaged_precision}")

In [None]:
#Computation of macro-averaged recall

def macro_recall(y_true, y_pred):

    # find the number of classes
    num_classes = len(np.unique(y_true))

    # initialize recall to 0
    recall = 0
    
    # loop over all classes
    for class_ in list(y_true.unique()):
        
        # all classes except current are considered negative
        temp_true = [1 if p == class_ else 0 for p in y_true]
        temp_pred = [1 if p == class_ else 0 for p in y_pred]
        
        
        # compute true positive for current class
        tp = true_positive(temp_true, temp_pred)
        
        # compute false negative for current class
        fn = false_negative(temp_true, temp_pred)
        
        
        # compute recall for current class
        temp_recall = tp / (tp + fn + 1e-6)
        
        # keep adding recall for all classes
        recall += temp_recall
        
    # calculate and return average recall over all classes
    recall /= num_classes
    
    return recall

print(f"Macro-averaged recall score : {macro_recall(y_test, y_pred)}")


# implement macro-averaged recall using sklearn

macro_averaged_recall = metrics.recall_score(y_test, y_pred, average = 'macro')
print(f"Macro-averaged recall score using sklearn : {macro_averaged_recall}")


In [None]:
#Computation of micro-averaged recall

def micro_recall(y_true, y_pred):


    # find the number of classes 
    num_classes = len(np.unique(y_true))
    
    # initialize tp and fp to 0
    tp = 0
    fn = 0
    
    # loop over all classes
    for class_ in y_true.unique():
        
        # all classes except current are considered negative
        temp_true = [1 if p == class_ else 0 for p in y_true]
        temp_pred = [1 if p == class_ else 0 for p in y_pred]
        
        # calculate true positive for current class
        # and update overall tp
        tp += true_positive(temp_true, temp_pred)
        
        # calculate false negative for current class
        # and update overall tp
        fn += false_negative(temp_true, temp_pred)
        
    # calculate and return overall recall
    recall = tp / (tp + fn)
    return recall

print(f"Micro-averaged recall score : {micro_recall(y_test, y_pred)}")


#  implement micro-averaged recall using sklearn

micro_averaged_recall = metrics.recall_score(y_test, y_pred, average = 'micro')
print(f"Micro-Averaged recall score using sklearn library : {micro_averaged_recall}")

In [None]:
#Computation of macro-averaged f1 score

def macro_f1(y_true, y_pred):

    # find the number of classes
    num_classes = len(np.unique(y_true))

    # initialize f1 to 0
    f1 = 0
    
    # loop over all classes
    for class_ in list(y_true.unique()):
        
        # all classes except current are considered negative
        temp_true = [1 if p == class_ else 0 for p in y_true]
        temp_pred = [1 if p == class_ else 0 for p in y_pred]
        
        
        # compute true positive for current class
        tp = true_positive(temp_true, temp_pred)
        
        # compute false negative for current class
        fn = false_negative(temp_true, temp_pred)
        
        # compute false positive for current class
        fp = false_positive(temp_true, temp_pred)
        
        
        # compute recall for current class
        temp_recall = tp / (tp + fn + 1e-6)
        
        # compute precision for current class
        temp_precision = tp / (tp + fp + 1e-6)
        
        
        temp_f1 = 2 * temp_precision * temp_recall / (temp_precision + temp_recall + 1e-6)
        
        # keep adding f1 score for all classes
        f1 += temp_f1
        
    # calculate and return average f1 score over all classes
    f1 /= num_classes
    
    return f1


print(f"Macro-averaged f1 score : {macro_f1(y_test, y_pred)}")


# implement macro-averaged F1 score using sklearn

macro_averaged_f1 = metrics.f1_score(y_test, y_pred, average = 'macro')
print(f"Macro-Averaged F1 score using sklearn library : {macro_averaged_f1}")

In [None]:
#Computation of micro-averaged fi score

def micro_f1(y_true, y_pred):


    #micro-averaged precision score
    P = micro_precision(y_true, y_pred)

    #micro-averaged recall score
    R = micro_recall(y_true, y_pred)

    #micro averaged f1 score
    f1 = 2*P*R / (P + R)    

    return f1

print(f"Micro-averaged recall score : {micro_f1(y_test, y_pred)}")


# implement micro-averaged F1 score using sklearn

micro_averaged_f1 = metrics.f1_score(y_test, y_pred, average = 'micro')
print(f"Micro-Averaged F1 score using sklearn library : {micro_averaged_f1}")


In [None]:
# ROC AUCurve Computation

from sklearn.metrics import roc_auc_score

def roc_auc_score_multiclass(actual_class, pred_class, average = "macro"):
    
    #creating a set of all the unique classes using the actual class list
    unique_class = set(actual_class)
    roc_auc_dict = {}
    for per_class in unique_class:
        
        #creating a list of all the classes except the current class 
        other_class = [x for x in unique_class if x != per_class]

        #marking the current class as 1 and all other classes as 0
        new_actual_class = [0 if x in other_class else 1 for x in actual_class]
        new_pred_class = [0 if x in other_class else 1 for x in pred_class]

        #using the sklearn metrics method to calculate the roc_auc_score
        roc_auc = roc_auc_score(new_actual_class, new_pred_class, average = average)
        roc_auc_dict[per_class] = roc_auc

    return roc_auc_dict

roc_auc_dict = roc_auc_score_multiclass(y_test, y_pred)
roc_auc_dict

In [None]:
# ROC implementation: 

import matplotlib.pyplot as plt
from sklearn import svm, datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import label_binarize
from sklearn.metrics import roc_curve, auc
from sklearn.multiclass import OneVsRestClassifier
from itertools import cycle
plt.style.use('ggplot')

# Load the iris data
iris = datasets.load_iris()
X = iris.data
y = iris.target# Binarize the output
y_bin = label_binarize(y, classes=[0, 1, 2])
n_classes = y_bin.shape[1]# We split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y_bin, test_size= 0.5, random_state=0)


# We define the model as an SVC in OneVsRestClassifier setting.
# this means that the model will be used for class 1 vs class 2, 
# class 2vs class 3 and class 1 vs class 3. 
# So, we have 3 cases at #the end and within each case, the bias will be varied in order to 
# Get the ROC curve of the given case - 3 ROC curves as output.

classifier = OneVsRestClassifier(svm.SVC(kernel='linear', probability=True, random_state=0))
y_score = classifier.fit(X_train, y_train).decision_function(X_test)
# Plotting and estimation of FPR, TPR
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
    fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])
    roc_auc[i] = auc(fpr[i], tpr[i])
colors = cycle(['blue', 'red', 'green'])

for i, color in zip(range(n_classes), colors):
    plt.plot(fpr[i], tpr[i], color=color, lw=1.5, label='ROC curve of class {0} (area = {1:0.2f})' ''.format(i+1, roc_auc[i]))
    plt.plot([0, 1], [0, 1], 'k-', lw=1.5)
    plt.xlim([-0.05, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic for multi-class data')
    plt.legend(loc="lower right")
    plt.show()