<a href="https://colab.research.google.com/github/sethkipsangmutuba/Artificial-Intelligence/blob/main/Introduction_to_AI.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **1. Understanding Artificial Intelligence (AI)**
Artificial Intelligence (AI) is the science of creating systems that can perform tasks requiring human-like intelligence, such as **learning, reasoning, problem-solving, perception, and natural language processing**.

AI is deeply embedded in our daily activities, including:
- **Internet searches** (Google’s ranking algorithms)
- **Face recognition** (smartphones, security systems)
- **Speech-to-text conversion** (Siri, Google Assistant)
- **Recommendation systems** (Netflix, Amazon, YouTube)
- **Autonomous Vehicles** (self-driving cars)

### **1.1 Why Study AI?**
- AI **automates complex decision-making** and enhances efficiency.
- It **optimizes big data** for business insights and predictions.
- AI drives **technological advancements** in robotics, healthcare, and finance.
- Understanding AI allows us to **build smarter applications** to solve real-world problems.

---

## **2. AI in Data Science & Industry Applications**
AI plays a significant role across various industries:
- **Healthcare**: AI-assisted diagnostics, robotic surgeries, predictive analytics.
- **Finance**: Fraud detection, stock market predictions, risk assessment.
- **Retail & E-Commerce**: Personalized recommendations, automated chatbots.
- **Autonomous Systems**: Self-driving cars, AI-powered drones, robotic automation.
- **Cybersecurity**: AI-powered threat detection and prevention.

---

## **3. Python for AI: Essential Libraries & Setup**
### **3.1 Python Programming for AI**
Python is the most widely used language for AI and ML due to its simplicity and vast ecosystem of libraries.

### **3.2 Setting Up Your AI Development Environment**
- Install **Python 3**: [Download Here](https://www.python.org/)
- Use **virtual environments** (`venv`, `pip`, `conda`) for package management.
- Install **Jupyter Notebook** for an interactive coding environment.
- Set up **GitHub** for version control and project collaboration.

### **3.3 Essential Python Libraries for AI**
| **Library** | **Purpose** |
|------------|-------------|
| **NumPy** | Numerical computing, matrix operations |
| **Pandas** | Data preprocessing & manipulation |
| **Matplotlib & Seaborn** | Data visualization & exploratory analysis |
| **Scikit-Learn** | Machine learning algorithms & models |
| **TensorFlow/PyTorch** | Deep learning frameworks |

---

## **4. Mathematical Foundations for AI**
Mathematics is the backbone of AI, enabling model development and optimization.

### **4.1 Linear Algebra for AI**
- **Vectors & Matrices**: Used to store and manipulate data in ML models.
- **Eigenvalues & Eigenvectors**: Used in **Principal Component Analysis (PCA)**.
- **Matrix Operations**: Transposition, inversion, determinants in ML.

### **4.2 Probability & Statistics in AI**
- **Bayes' Theorem**: Used in spam detection and predictive modeling.
- **Gaussian Distributions**: Common in data modeling and anomaly detection.
- **Statistical Inference**: Hypothesis testing and confidence intervals.

### **4.3 Optimization Techniques for AI Models**
Optimization helps AI models improve their predictions.
- **Gradient Descent**: Algorithm for finding optimal parameters in ML.
- **Adam, RMSprop**: Advanced optimizers for deep learning.
- **Loss Functions**: Mean Squared Error (MSE), Cross-Entropy.

### **4.4 Vector Calculus for AI**
- **Derivatives & Gradients**: Core of optimization algorithms.
- **Backpropagation**: Key mechanism in training neural networks.
- **Hessian & Jacobian Matrices**: Higher-order optimization in AI.

---

## **5. Capstone Project Kickoff: Defining a Real-World AI Problem**

####**submit by sunday date 2/03/2025**
A **capstone project** will help apply AI concepts to real-world problems.

### **5.1 Project Planning**
- **Define the problem statement**: Identify a relevant AI challenge.
- **Gather and preprocess data**: Use Python libraries like Pandas & NumPy.
- **Set evaluation metrics**: Define success criteria for the project.

### **5.2 Exploratory Data Analysis (EDA)**
- **Data visualization**: Use Matplotlib & Seaborn to explore patterns.
- **Feature selection**: Choose relevant data features for AI models.
- **Data cleaning**: Handle missing values, outliers, and normalization.

---

## **6. Hands-on Exercises**
### **6.1 Setting Up Python & GitHub**
- Install required Python libraries.
- Set up Jupyter Notebook for coding.
- Upload and manage projects on GitHub.

### **6.2 Implementing Basic AI Models**
#### **Linear Regression using NumPy & Scikit-Learn**
- Train a simple regression model.
- Predict outcomes and evaluate model performance.

#### **Simple Classification Model with TensorFlow**
- Implement a basic neural network classifier.
- Train on labeled datasets and visualize results.

### **6.3 Visualizing AI Models**
- Plot loss curves to track training progress.
- Generate decision boundaries in classification models.

---

## **7. Week 1 Outcome**
By the end of this week, you will:
- Understand AI’s role in different industries.  
- Set up Python, Jupyter Notebook, and GitHub for AI development.  
- Gain mathematical knowledge essential for AI models.  
- Implement basic AI models using Python & TensorFlow.  
- Define and plan a real-world AI capstone project.  


# **What is Artificial Intelligence?**  

Artificial Intelligence (AI) enables machines to **think, learn, and act intelligently**. It involves developing **software** that mimics **human reasoning, decision-making, and problem-solving**.  

### **Key Aspects of AI**  
AI focuses on making machines:  
- **Sense** their environment  
- **Reason** based on data  
- **Think** like humans  
- **Act** rationally  

### **AI & Human Intelligence**  
AI is inspired by the **human brain**, aiming to replicate how it **learns and makes decisions**.  
By understanding human cognition, researchers create **intelligent systems** capable of **autonomous learning and adaptation**.  


# **Why Do We Need to Study AI?**  

AI impacts **every aspect of life** by recognizing patterns, automating tasks, and enhancing intelligence.  

## **AI vs. Human Intelligence**  

| **Human Brain**                           | **Artificial Intelligence**                   |
|-------------------------------------------|----------------------------------------------|
| Recognizes objects, understands language | Learns from data, mimics human thinking     |
| Processes information effortlessly        | Requires algorithms & computations          |
| Adapts to new situations                  | Continuously improves through training      |
| Limited by biological constraints         | Can scale infinitely with computational power |

## **Challenges in the Modern World**  

| **Challenges**                          | **How AI Solves Them**                          |
|-----------------------------------------|-----------------------------------------------|
| **Massive, unstructured data**         | Processes large datasets efficiently        |
| **Multiple data sources**               | Ingests real-time data simultaneously       |
| **Constantly changing information**    | Learns & updates continuously               |
| **Need for real-time decisions**       | Responds instantly with high precision      |

## **AI's Role in Automation & Efficiency**  

| **AI Capabilities**                     | **Impact**                                  |
|-----------------------------------------|-------------------------------------------|
| **Data processing & analysis**         | Generates insights from complex data     |
| **Machine learning & adaptation**      | Continuously improves decision-making    |
| **Real-time response & precision**     | Enhances speed & accuracy in applications |
| **Smart automation**                   | Reduces human effort & increases efficiency |

AI is **revolutionizing industries** by making machines **smarter, faster, and more efficient**.


# **Applications of AI**  

AI is transforming various industries by enabling machines to **see, hear, understand, and make decisions**.  

## **Key AI Applications Across Industries**  

| **Application**                 | **Description**                                         | **Examples** |
|---------------------------------|---------------------------------------------------------|-------------|
| **Computer Vision**             | Analyzes visual data (images/videos)                   | Reverse image search, facial recognition, autonomous vehicles |
| **Natural Language Processing (NLP)** | Understands and processes human language       | Search engines, chatbots, sentiment analysis |
| **Speech Recognition**          | Converts spoken language into text                     | Virtual assistants (Siri, Alexa), voice-controlled devices |
| **Expert Systems**              | Uses AI to provide expert-level advice and decisions   | Medical diagnosis, financial forecasting |
| **AI in Games**                 | Develops intelligent agents for gaming                 | AlphaGo, adaptive AI in video games |
| **Robotics**                    | Builds AI-powered robots with sensors and actuators    | Industrial automation, autonomous drones |

AI continues to **expand rapidly**, shaping the future of technology across multiple domains.  


# **Branches of AI**  

AI consists of multiple specialized fields that address different types of problems.  

## **Key Branches of AI**  

| **Branch**                     | **Description**                                         | **Applications** |
|---------------------------------|---------------------------------------------------------|------------------|
| **Machine Learning (ML)**       | AI learns patterns from data and makes predictions     | Face recognition, recommendation systems |
| **Logic-Based AI**              | Uses mathematical logic to solve problems              | Language parsing, semantic analysis |
| **Search Algorithms**           | Examines possibilities to find the best solution       | Chess, networking, resource allocation |
| **Knowledge Representation**    | Organizes facts and relationships in a structured way  | Expert systems, ontology-based AI |
| **Planning**                    | Develops optimal strategies to achieve goals          | Autonomous robotics, logistics optimization |
| **Heuristics**                  | Uses practical problem-solving methods for quick results | Search engines, robotics navigation |
| **Genetic Programming**         | Evolves AI models by mimicking natural selection      | AI optimization, automated software generation |

Each **AI branch** focuses on a different problem-solving approach, making AI highly **versatile and powerful**.


# **Defining Intelligence Using the Turing Test**  

Alan Turing proposed the **Turing Test** to define intelligence by evaluating whether a machine can mimic human behavior in a conversation.  

## **Turing Test Setup**  

| **Component**               | **Description** |
|----------------------------|----------------|
| **Interrogator (Human)**   | Asks questions via a text interface |
| **Respondents**            | One human and one machine |
| **Objective**              | The machine must mimic human responses well enough to fool the interrogator |

### **Core AI Areas Required to Pass the Test**  

| **AI Domain**                  | **Role in the Turing Test** |
|--------------------------------|----------------------------|
| **Natural Language Processing (NLP)** | Helps the machine understand and generate human-like responses |
| **Knowledge Representation**   | Stores and retrieves information relevant to the conversation |
| **Reasoning**                  | Interprets and processes information logically |
| **Machine Learning**           | Adapts and improves responses based on experience |

### **Extended Turing Test Variants**  

| **Test Type**         | **Additional Requirements** |
|----------------------|---------------------------|
| **Standard Turing Test** | Machine communicates via text only |
| **Total Turing Test** | Machine must also see objects (**Computer Vision**) and move (**Robotics**) |

Turing believed that **physical appearance** was **not necessary** for intelligence, which is why the original test only used **text-based interaction**.  

Understanding the **Turing Test** provides insight into AI’s ability to **replicate human-like intelligence**!  


# **Making Machines Think Like Humans**  

To make machines think like humans, we must first understand **human thought processes**.  

## **Approaches to Understanding Human Thinking**  

| **Approach**                    | **Description** |
|----------------------------------|----------------|
| **Observing Human Responses**    | Recording human behavior and reactions, but this becomes complex with too much data |
| **Experimentation**              | Creating structured questions and analyzing human responses to build a model |

Once enough **data is gathered**, we develop **AI models** that mimic human cognition. If the **output** aligns with **human behavior**, we can say that machines are thinking similarly to humans.  

## **Cognitive Modeling in AI**  

| **Concept**                    | **Role in AI** |
|---------------------------------|----------------|
| **Cognitive Modeling**         | Simulates human thought processes in AI systems |
| **Problem-Solving Simulation** | Models how humans approach and solve problems |
| **AI Applications**            | Used in **Deep Learning, Expert Systems, NLP, and Robotics** |

Cognitive modeling helps build AI systems that can **reason, learn, and adapt**, making them **more human-like** in thinking and decision-making!  


# **Building Rational Agents**  

AI research focuses on creating **rational agents** that act based on **rationality**—choosing actions that maximize benefits in a given situation.  

## **Key Concepts of Rational Agents**  

| **Concept**        | **Definition** |
|--------------------|---------------|
| **Rationality**    | Doing the **right thing** based on available information |
| **Agent**         | An entity that perceives and acts in an environment |
| **Rational Agent** | Takes actions to achieve goals efficiently and intelligently |

## **How Rational Agents Work**  

| **Step**              | **Description** |
|----------------------|----------------|
| **Perception**       | Agent **receives input** from its environment |
| **Processing**       | It **analyzes** information based on predefined rules |
| **Action**          | Takes the **best possible action** to achieve its goal |

## **Performance Measure of Rational Agents**  

| **Factor**              | **Impact on Rationality** |
|------------------------|-------------------------|
| **Success Rate**       | Measures how well the agent **achieves its goal** |
| **Inference Ability**  | Ability to **draw correct conclusions** from information |
| **Handling Uncertainty** | Acts even in situations where no perfect solution exists |

A **rational agent** should **adapt to new environments**, **make logical decisions**, and **optimize performance** to achieve its objectives efficiently.

# **General Problem Solver (GPS)**  

The **General Problem Solver (GPS)** was an AI program developed by **Herbert Simon, J.C. Shaw, and Allen Newell**. It was one of the first **universal problem-solving** systems in AI. Unlike earlier programs designed for specific tasks, GPS aimed to solve **any general problem** using a **single algorithm**.  

## **Key Features of GPS**  

| **Feature**                 | **Description** |
|-----------------------------|---------------|
| **Universal Problem Solving** | Designed to solve **various types of problems** using the same algorithm |
| **Information Processing Language (IPL)** | Custom language created to define problem structures |
| **Graph-Based Approach** | Problems represented as **graphs** with **sources (axioms)** and **sinks (conclusions)** |
| **Brute Force Search** | Used search algorithms to explore **possible solutions**, but faced scalability issues |

## **Limitations of GPS**  

| **Limitation**              | **Impact** |
|----------------------------|------------|
| **Only Solves Well-Defined Problems** | Could handle **mathematical proofs, logic puzzles, and chess**, but struggled with real-world tasks |
| **Computational Complexity** | Required **excessive resources** to brute force large problem spaces |
| **Predefined Problem Structure** | Required problems to be framed in a **specific logical format** |

## **Example: Solving a Problem with GPS**  

**Goal:** Get milk from a grocery store.  

| **Step** | **Description** |
|----------|---------------|
| **Define Goals** | The objective is to **buy milk** from the store. |
| **Set Preconditions** | - Need a **mode of transportation**.<br>- Grocery store must **have milk available**. |
| **Define Operators** | - If using a car, check **fuel availability**.<br>- Ensure **payment capability** for both fuel and milk. |

GPS used a **search-based approach**, evaluating **all possible actions** to determine the best way to achieve a goal. However, due to **computational complexity**, it struggled with **large-scale real-world applications**.

# **Building an Intelligent Agent**  

To build an **intelligent agent**, we use various techniques like **machine learning, stored knowledge, and rule-based systems**. Among these, **machine learning** is the most widely used method, allowing agents to learn from **data and training**.  

## **How an Intelligent Agent Interacts with the Environment**  

 **Sensor Perception** → Collects data from the environment.  
 **Feature Extraction** → Identifies relevant patterns from the input data.  
 **Inference Engine** → Uses a **trained machine learning model** to make predictions.  
 **Decision Making** → Based on predictions, it determines the best action.  
 **Actuator Execution** → Performs the necessary real-world action.  

## **Applications of Intelligent Agents**  

| **Application**         | **Use Case** |
|-------------------------|-------------|
| **Image Recognition**   | Facial recognition, object detection |
| **Robotics**            | Autonomous vehicles, industrial automation |
| **Speech Recognition**  | Voice assistants, real-time transcription |
| **Stock Market Prediction** | Financial forecasting and trend analysis |

## **Key Concepts in Machine Learning for Intelligent Agents**  

| **Concept**               | **Description** |
|---------------------------|---------------|
| **Pattern Recognition**   | Identifying trends and structures in data |
| **Artificial Neural Networks** | Mimicking the human brain for complex decision-making |
| **Data Mining**           | Extracting useful insights from large datasets |
| **Statistics**            | Understanding probabilities and data distributions |

By integrating **machine learning**, intelligent agents can **continuously improve** and make **better decisions** over time.


# **Types of Models in AI**  

AI models can be categorized into two main types: **Analytical Models** and **Learned Models**.  

# **Types of Models in AI**  

AI models can be categorized into two main types: **Analytical Models** and **Learned Models**.  

## **Comparison Table**  

| **Model Type**        | **Definition**  | **Characteristics**  | **Advantages**  | **Limitations**  | **Example Use Cases**  |
|----------------------|----------------|---------------------|-----------------|------------------|------------------------|
| **Analytical Models**  | Traditional models based on mathematical formulas to describe relationships. | - Based on human judgment  <br> - Uses predefined equations  <br> - Simplistic with few parameters  | - Interpretable & explainable  <br> - Works well for simple systems  | - Limited accuracy  <br> - Cannot handle large datasets efficiently  | Physics-based simulations, financial forecasting with linear models  |
| **Learned Models**  | Models trained using data rather than predefined formulas. | - Learns patterns from data  <br> - Uses machine learning  <br> - Highly complex with many parameters  | - Handles large datasets  <br> - High accuracy  <br> - Can adapt to new data  | - Requires large amounts of data  <br> - Computationally expensive  | Image recognition, speech processing, autonomous driving |

## **Machine Learning and Learned Models**  
Machine learning allows AI to automatically **discover patterns and relationships** from data, replacing the need for manually derived formulas.  

- **Labeled Inputs** – Features provided to the model  
- **Corresponding Outputs** – Desired results  

Once trained, AI models can **predict outcomes** based on unseen inputs.  


In [None]:
#Import necessary libraries
from sklearn.datasets import fetch_california_housing

# Step 2: Load the dataset
data = fetch_california_housing()
data

### Decr attribute

In [None]:
data.DESCR

In [None]:
# Print the first five rows of data
print("First 5 rows of the dataset:\n", data.data[:20])

# **California Housing Dataset**  

The **California Housing Dataset** is a regression dataset derived from the **1990 U.S. Census**, containing **20,640 instances** with **8 numerical features** and **no missing values**.  

## **Key Features:**  
- Represents **block groups** (smallest U.S. Census units) with populations between **600–3,000 people**.  
- Attributes include:  
  - **MedInc** → Median income in block group  
  - **HouseAge** → Median house age  
  - **AveRooms** → Average rooms per household  
  - **AveBedrms** → Average bedrooms per household  
  - **Population** → Block group population  
  - **AveOccup** → Average household members  
  - **Latitude** → Block group latitude  
  - **Longitude** → Block group longitude  
- **Target Variable:** Median house value (in **$100,000 units**).  



In [None]:
# Convert to DataFrame for easy handling
import pandas as pd
df = pd.DataFrame(data.data, columns=data.feature_names)
df

In [None]:
df.info()

In [None]:
df.shape

In [None]:
df.duplicated().sum()

In [None]:
df.isnull().sum()

In [None]:
df.describe()

In [None]:
# Add the target variable (house value)
df['Target'] = data.target
df['Target']

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
# Plot the distribution of the target variable
plt.figure(figsize=(8, 5))
sns.histplot(df['HouseAge'], bins=30, kde=True, color='blue')
plt.title("Distribution of Median House Values")
plt.xlabel("Median House Value (in $100,000s)")
plt.ylabel("Frequency")
plt.show()

In [None]:
# Plot the distribution of the target variable
plt.figure(figsize=(8, 5))
sns.histplot(df['Population'], bins=30, kde=True, color='blue')
plt.title("Distribution of Median House Values")
plt.xlabel("Median House Value (in $100,000s)")
plt.ylabel("Frequency")
plt.show()

In [None]:
# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Plot correlation heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Feature Correlation Heatmap")
plt.show()

In [None]:
df.corr()

In [None]:
# Define features (X) and target (y)
X = df.drop(columns=['Target'])
X

In [None]:
y = df['Target']
y

In [None]:
from sklearn.model_selection import train_test_split

# Split dataset into training (80%) and testing (20%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train

In [None]:
X_test

In [None]:
y_train

In [None]:
y_test

In [None]:
X_train.shape

In [None]:
X_test.shape

In [None]:
from sklearn.preprocessing import StandardScaler

# Standardize the features
scaler = StandardScaler()
scaler

In [None]:
X_train_scaled = scaler.fit_transform(X_train)
X_train_scaled

In [None]:
X_test_scaled = scaler.transform(X_test)
X_test_scaled

In [None]:
from sklearn.linear_model import LinearRegression
# Train a linear regression model
model = LinearRegression()
model

In [None]:
model.fit(X_train_scaled, y_train)

In [None]:
# Predict on the test set
y_pred = model.predict(X_test_scaled)
y_pred

In [None]:
from sklearn.metrics import mean_absolute_error

# Calculate error metrics
mae = mean_absolute_error(y_test, y_pred)
mae

In [None]:
import numpy as np
rmse = np.sqrt(mse)
rmse

In [None]:
from sklearn.metrics import r2_score
r2 = r2_score(y_test, y_pred)
r2

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Scatter plot of actual vs. predicted values
plt.figure(figsize=(8, 6))
sns.scatterplot(x=y_test, y=y_pred, alpha=0.5, color='red')
plt.plot([0, 7], [0, 7], '--', color='blue')  # Ideal prediction line
plt.xlabel("Actual Median House Value")
plt.ylabel("Predicted Median House Value")
plt.title("Actual vs. Predicted House Values")
plt.show()

## Model Performance Evaluation

The model's performance is assessed using several key metrics:

- **Mean Absolute Error (MAE):** **0.5332**  
  - On average, the model's predictions deviate by **$53,320** from actual house prices.

- **Mean Squared Error (MSE):** **0.5559**  
  - Represents the squared differences between predicted and actual values, emphasizing larger errors.

- **Root Mean Squared Error (RMSE):** **0.7456**  
  - The typical prediction error is around **$74,560**, making it more interpretable than MSE.

- **R-squared Score (R²):** **0.5758**  
  - Indicates that the model explains about **57.58% of the variance** in house prices.  
  - While this suggests moderate accuracy, further improvements can be made.

### Possible Enhancements:
- Refining feature selection for better input variables.
- Experimenting with polynomial regression for capturing complex relationships.
- Trying advanced models like Decision Trees, Random Forests, or Neural Networks for improved predictive power.


In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Create a pipeline with feature scaling and a RandomForest model
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', RandomForestRegressor(n_estimators=100, random_state=42))
])

# Hyperparameter tuning with GridSearchCV
param_grid = {
    'model__n_estimators': [50, 100, 200],
    'model__max_depth': [None, 10, 20]
}

grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='r2', n_jobs=-1)
grid_search.fit(X_train, y_train)

In [None]:
# Evaluate the best model
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
y_pred

In [None]:
# Compute evaluation metrics
mae = mean_absolute_error(y_test, y_pred)
mae

In [None]:
mse = mean_squared_error(y_test, y_pred)
mse

In [None]:
rmse = np.sqrt(mse)
rmse

In [None]:
r2 = r2_score(y_test, y_pred)
r2

In [None]:
#  Extract feature importance from the trained Random Forest model
feature_importance = best_model.named_steps['model'].feature_importances_
feature_importance

In [None]:
#  Import necessary library for interactive plotting
import plotly.express as px

# Retrieve feature names from the dataset (ensure X is a DataFrame)
feature_names = X.columns
feature_names

In [None]:
import pandas as pd
# Create a DataFrame to store feature names and their importance scores
importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': feature_importance})
importance_df

In [None]:
# Sort features by importance in descending order for better visualization
importance_df = importance_df.sort_values(by='Importance', ascending=True)
importance_df

In [None]:
import plotly.express as px

# Create an interactive horizontal bar chart using Plotly Express
fig = px.bar(importance_df,
             x='Importance',
             y='Feature',
             orientation='h',  # Horizontal bar chart
             title="Feature Importance in Random Forest Model",
             labels={'Importance': 'Importance Score', 'Feature': 'Feature'},
             color='Importance',  # Color based on importance values
             color_continuous_scale='viridis')  # Use Viridis color scale
# Show the interactive plot
fig.show()

In [None]:
# Compute residuals (errors)
residuals = y_test - y_pred
residuals

In [None]:
#  Plot histogram of residuals
fig_hist = px.histogram(residuals,
                        nbins=50,
                        title="Residual Distribution",
                        labels={'value': 'Residuals'},
                        marginal="box",  # Add boxplot for distribution insight
                        color_discrete_sequence=['royalblue'])

fig_hist.show()

In [None]:
# Scatter plot of residuals vs. predicted values
fig_scatter = px.scatter(x=y_pred,
                          y=residuals,
                          title="Residuals vs. Predicted Values",
                          labels={'x': 'Predicted Values', 'y': 'Residuals'},
                          color=residuals,
                          color_continuous_scale='viridis')

# Add a horizontal reference line at y=0 (no error)
fig_scatter.add_hline(y=0, line_dash="dash", line_color="red")

fig_scatter.show()

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Compute Variance Inflation Factor (VIF)
vif_data = pd.DataFrame()
vif_data["Feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif_data["VIF"]

In [None]:
import numpy as np
import plotly.express as px
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

#  Sort and display VIF results
vif_data = vif_data.sort_values(by="VIF", ascending=True)
fig_vif = px.bar(vif_data, x="VIF", y="Feature", title="Variance Inflation Factor (VIF) for Features", orientation="h", color="VIF", color_continuous_scale="viridis")
fig_vif.show()


## **Interpreting Variance Inflation Factor (VIF) Results**

**If VIF is low (<5)** → No multicollinearity issues, features are independent.  

**If VIF is high (>5 or 10)** → Some features might be redundant and should be removed or transformed.  

**If correlations are strong** → Consider feature selection techniques such as **Principal Component Analysis (PCA)** or **Lasso Regression** to reduce dimensionality.  

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
import numpy as np

# Function to calculate VIF
def calculate_vif(X):
    vif_data = pd.DataFrame()
    vif_data["Feature"] = X.columns
    vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    return vif_data

# Compute VIF
vif_df = calculate_vif(X)

# Drop features with high VIF (>5)
high_vif_features = vif_df[vif_df["VIF"] > 5]["Feature"].tolist()
X_selected = X.drop(columns=high_vif_features)

print("Dropped Features:", high_vif_features)

In [None]:
from sklearn.decomposition import PCA

# Reduce dimensionality while keeping 95% variance
pca = PCA(n_components=0.95)
X_pca = pca.fit_transform(X_selected)

print("New Shape after PCA:", X_pca.shape)

In [None]:
from sklearn.linear_model import Lasso
from sklearn.feature_selection import SelectFromModel

# Apply Lasso Regression to select important features
lasso = Lasso(alpha=0.01)  # Adjust alpha based on need
lasso.fit(X_selected, y)

# Select features with non-zero coefficients
selected_features = X_selected.columns[lasso.coef_ != 0]
X_lasso = X_selected[selected_features]

print("Selected Features after Lasso:", selected_features.tolist())
