<a href="https://colab.research.google.com/github/raamponsah/visualization-workshops/blob/main/Visualization_with_Seaborn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Seaborn Visualization Cheat Sheet**

## **1. Distribution Plots (Single Variable)**
> **Use when:** Understanding data distribution (e.g., income levels, test scores)

### **Best Plots:**
- **Histogram:** `sns.histplot()` → Numeric data distribution
- **KDE Plot:** `sns.kdeplot()` → Smooth density curve
- **Box Plot:** `sns.boxplot()` → Quartiles & outliers

### **Example:**
```python
sns.histplot(data=tips, x="tip", bins=20, kde=True)
plt.show()
```

The key to choosing the right visualization is recognizing patterns in your data and the type of insight you want. Here’s a simple decision framework to make it easy.

---

## 2. Comparison Plots (Categories vs. Numeric)
**Use when:** Comparing different groups (e.g., male vs. female, product sales per region)

### Best Plots:
- **Bar Plot:** `sns.barplot()` - Shows averages per category
- **Count Plot:** `sns.countplot()` - Shows frequency per category
- **Box Plot:** `sns.boxplot()` - Shows data spread

### Example:
```python
sns.barplot(x="day", y="tip", data=tips)
plt.show()
```

## 3. Relationship Plots (Numeric vs. Numeric)
**Use when:** Analyzing relationships between two numeric variables (e.g., age vs. salary)

### Best Plots:
- **Scatter Plot:** `sns.scatterplot()` - Best for relationships
- **Line Plot:** `sns.lineplot()` - Best for trends over time
- **Regression Plot:** `sns.lmplot()` - Trends + predictions

### Example:
```python
sns.scatterplot(x="total_bill", y="tip", hue="sex", data=tips)
plt.show()
```

## 4. Trend & Time Series Plots
**Use when:** Identifying patterns over time (e.g., stock prices, daily sales)

### Best Plots:
- **Line Plot:** `sns.lineplot()` - Best for time trends
- **Area Plot:** `sns.lineplot()` with `fill_between` - Shows cumulative trends

### Example:
```python
sns.lineplot(x="date", y="sales", hue="store", data=sales_data)
plt.show()
```

## 5. Correlation & Relationship Strength
**Use when:** Checking how variables influence each other (e.g., study hours vs. exam scores)

### Best Plots:
- **Heatmap:** `sns.heatmap()` - Best for correlation analysis
- **Pair Plot:** `sns.pairplot()` - Quick multi-variable relationships

### Example:
```python
sns.heatmap(tips.corr(), annot=True, cmap="coolwarm")
plt.show()
```

## 6. Multi-Category Analysis
**Use when:** Comparing multiple categories at once (e.g., sales per region per year)

### Best Plots:
- **FacetGrid:** `sns.FacetGrid()` - Splits into multiple charts
- **Violin Plot:** `sns.violinplot()` - Shows distribution + quartiles

### Example:
```python
g = sns.FacetGrid(tips, col="day", row="time")
g.map(sns.histplot, "total_bill")
plt.show()
```

## Final Cheat Sheet
| Insight Needed                | Best Plot                 | Function              |
|--------------------------------|---------------------------|-----------------------|
| Distribution of one variable   | Histogram / KDE Plot     | `sns.histplot()` / `sns.kdeplot()` |
| Comparison across categories   | Bar Plot / Box Plot      | `sns.barplot()` / `sns.boxplot()` |
| Relationship between two numbers | Scatter / Regression Plot | `sns.scatterplot()` / `sns.lmplot()` |
| Trends over time              | Line Plot                | `sns.lineplot()`      |
| Correlation between features  | Heatmap                  | `sns.heatmap()`       |
| Multi-variable relationships  | Pair Plot                | `sns.pairplot()`      |
| Compare categories + distribution | Violin Plot         | `sns.violinplot()`    |
| Multiple charts by category   | FacetGrid                | `sns.FacetGrid()`     |

---

# **Mini Project: Analyzing Restaurant Tips with Seaborn**

## **Objective:**
Analyze the `tips` dataset to uncover spending and tipping patterns.

---

## **Step 1: Load and Explore the Dataset**

```python
import seaborn as sns
import matplotlib.pyplot as plt

# Load dataset
tips = sns.load_dataset("tips")

# Quick overview
print(tips.head())
print(tips.info())
```

## **Step 2: What is the distribution of total bill amounts?**

```python
sns.histplot(tips['total_bill'], bins=20, kde=True)
plt.title("Distribution of Total Bill Amounts")
plt.show()
```

✅ **Insight:** Most bills fall between **$10 and $40**, with a peak around **$20-$25**.

---

## **Step 3: Do men or women tip more on average?**

```python
sns.barplot(x="sex", y="tip", data=tips)
plt.title("Average Tip Amount by Gender")
plt.show()
```

✅ **Insight:** Women and men tip nearly the same on average, but men slightly more.

---

## **Step 4: Which day sees the highest total bill amount?**

```python
sns.boxplot(x="day", y="total_bill", data=tips)
plt.title("Total Bill Distribution by Day")
plt.show()
```

✅ **Insight:** **Saturday & Sunday have the highest bills**, indicating weekends have more spending.

---

## **Step 5: How does the tip amount vary with the total bill?**

```python
sns.lmplot(x="total_bill", y="tip", hue="sex", data=tips)
plt.title("Total Bill vs. Tip Amount")
plt.show()
```

✅ **Insight:** Tip increases as **total bill increases**, but not proportionally.

---

## **Step 6: Do smokers tip differently than non-smokers?**

```python
sns.violinplot(x="smoker", y="tip", data=tips)
plt.title("Tip Amount Distribution (Smokers vs Non-Smokers)")
plt.show()
```

✅ **Insight:** Smokers have **more variation** in tips but tip slightly **lower on average**.

---

## **Step 7: Which meal time (Lunch/Dinner) has higher spending?**

```python
sns.boxplot(x="time", y="total_bill", data=tips)
plt.title("Total Bill Amount by Meal Time")
plt.show()
```

✅ **Insight:** Dinner sees **higher bills than Lunch**, indicating **more expensive meals or larger groups**.

---

## **Final Observations:**
📌 **Weekend spending is higher**  
📌 **Dinner is more expensive than lunch**  
📌 **Men tip slightly more than women**  
📌 **Smokers tip less on average**  

---

### **Next Steps:**
- Analyze the impact of group size on tipping.
- Examine how tipping percentage varies by day and time.
- Compare tipping behaviors across different regions.



# **Mini Project: Bank Customer Analysis & Prediction**

## **Objective:**
Analyze bank customers and predict if they will subscribe to a term deposit.

---


## **Step 1: Load and Explore the Dataset**

```python
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load dataset
url = "https://raw.githubusercontent.com/selva86/datasets/master/BankMarketing.csv"
df = pd.read_csv(url)

# Quick overview
print(df.head())
print(df.info())
print(df.describe())
```

## **Step 2: Data Cleaning & Preprocessing**

```python
# Convert categorical columns to lowercase for consistency
df.columns = df.columns.str.lower()

# Convert target variable to binary (yes=1, no=0)
df['y'] = df['y'].map({'yes': 1, 'no': 0})

# Check for missing values
print(df.isnull().sum())
```

✅ **No missing values!** 🎉

---

## **Step 3: Exploratory Data Analysis (EDA)**

### **1️⃣ Which job type has the highest subscriptions?**

```python
plt.figure(figsize=(12,5))
sns.countplot(x='job', hue='y', data=df, order=df['job'].value_counts().index)
plt.xticks(rotation=45)
plt.title("Subscription Rate by Job Type")
plt.show()
```

✅ **Insight:** **Management & blue-collar jobs** have more customers, but professionals subscribe more.

---


### **2️⃣ How does age relate to subscription?**

```python
sns.histplot(df[df['y'] == 1]['age'], bins=20, kde=True, color='green', label="Subscribed")
sns.histplot(df[df['y'] == 0]['age'], bins=20, kde=True, color='red', label="Not Subscribed")
plt.legend()
plt.title("Age Distribution of Subscribed vs. Not Subscribed Customers")
plt.show()
```

✅ **Insight:** Older customers **subscribe more frequently** than younger ones.

---

### **3️⃣ Does previous campaign success affect new subscriptions?**

```python
sns.boxplot(x="poutcome", y="y", data=df)
plt.title("Previous Campaign Outcome vs Subscription")
plt.show()
```

✅ **Insight:** Customers who had a previous **successful** campaign are more likely to subscribe again!

---

## **Step 4: Feature Selection & Model Preparation**

```python
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Encode categorical variables
categorical_cols = ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'poutcome']
df_encoded = df.copy()

for col in categorical_cols:
    df_encoded[col] = LabelEncoder().fit_transform(df_encoded[col])

# Define features and target
X = df_encoded.drop(columns=['y'])
y = df_encoded['y']

# Split into train & test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training set size:", X_train.shape)
print("Testing set size:", X_test.shape)
```

✅ **Data is ready for prediction!**



## **Step 5: Train a Prediction Model (Logistic Regression)**

```python
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Train model
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Evaluate
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
```

✅ **Accuracy ~85%** on unseen data! 🎯

---


## **Final Observations & Next Steps**

📌 **Professionals & older customers** subscribe more.  
📌 **Previous successful campaigns matter!**  
📌 **The model predicts with ~85% accuracy.**  

🔹 **Next Steps:**  
- Try a **Decision Tree or Random Forest** for better accuracy.  
- Perform **hyperparameter tuning** to improve predictions.  
- Test on new customer data.  