### 8. Encode Categorical Variables

Machine learning models can't understand text. They need numbers.
So when you have categorical data like Sex = male/female or Embarked = C/Q/S, you need to convert those into numbers.

Each row uses binary flags (0 or 1) to show the category:

If Sex_male = 1, it's male; if 0 → female.

If Embarked_Q = 1, it's Queenstown.

If Embarked_S = 1, it's Southampton.

If both Embarked_Q and Embarked_S = 0, then it's Cherbourg (which was dropped because of drop_first=True).

In [None]:
df=pd.get_dummies(df,columns=["Sex","Embarked"],drop_first=True)
df

### 10. Correlation Heatmap

In [None]:
df.corr

|           | Survived | Age   | Fare  | Pclass | Sex\_male |
| --------- | -------- | ----- | ----- | ------ | --------- |
| Survived  | 1.00     | -0.08 | 0.26  | -0.34  | -0.54     |
| Age       | -0.08    | 1.00  | 0.09  | 0.36   | 0.09      |
| Fare      | 0.26     | 0.09  | 1.00  | -0.55  | -0.18     |
| Pclass    | -0.34    | 0.36  | -0.55 | 1.00   | 0.13      |
| Sex\_male | -0.54    | 0.09  | -0.18 | 0.13   | 1.00      |


| Value Range  | Meaning              |
| ------------ | -------------------- |
| `0.7 to 1.0` | Strong correlation   |
| `0.4 to 0.7` | Moderate correlation |
| `0.1 to 0.4` | Weak correlation     |
| `0 to 0.1`   | Very weak or none    |


# Hypothesis Testing

Hypothesis testing is a statistical method used to make decisions or inferences about a population based on sample data. It helps you answer questions like:

"Is there a real effect, or is it just due to random chance?"

- Null Hypothesis (H₀):
There is no effect, no difference, or no relationship.
Example: Gender does not affect survival.

- Alternative Hypothesis (H₁ or Ha):
There is an effect, a difference, or a relationship.
Example: Gender does affect survival.

The Process:

1. State the hypotheses (H₀ and H₁).

2. Choose a significance level (usually 0.05, i.e., 5% chance of being wrong).

3. Collect and analyze data using a statistical test (like t-test, chi-square, etc.).

4. Calculate the p-value (probability that the observed result could occur under H₀).

5. Compare the p-value to your threshold (α):

If p ≤ 0.05 → Reject H₀ (there’s likely a real effect).

If p > 0.05 → Fail to reject H₀ (not enough evidence for a real effect).



### Match Your Scenario to the Right Test

| **Scenario**                                                         | **Data Types**                       | **Use This Test**                           | **Purpose**                                  |
| -------------------------------------------------------------------- | ------------------------------------ | ------------------------------------------- | -------------------------------------------- |
| Compare two groups (e.g., Male vs Female Age)                        | Numerical vs Categorical (2 groups)  | **t-test**                                  | Checks if means are different                |
| Compare two groups (non-normal data)                                 | Numerical vs Categorical (2 groups)  | **Mann-Whitney U**                          | Same as t-test, but non-parametric           |
| Compare more than two groups (e.g., Age by Pclass)                   | Numerical vs Categorical (3+ groups) | **ANOVA**                                   | Test mean differences across multiple groups |
| Compare two categorical variables (e.g., Gender vs Survived)         | Categorical vs Categorical           | **Chi-Square Test**                         | Checks if there is a relationship            |
| Correlation between two numerical variables (e.g., Age vs Fare)      | Numerical vs Numerical               | **Pearson Correlation**                     | Measures linear relationship                 |
| Predict one variable from another (e.g., Survived from Age & Pclass) | Mixed                                | **Logistic Regression / Linear Regression** | Predict outcome                              |
| Test if a sample mean differs from a known value                     | One Numerical Variable               | **One Sample t-test**                       | Compare to baseline (e.g., mean = 50)        |


“Does a passenger’s class (Pclass) affect their chance of survival?”
we use:

✅ Chi-Square Test of Independence

🔍 Why Chi-Square?
Because:

Pclass is categorical (values: 1, 2, 3).

Survived is also categorical (values: 0 or 1).

You’re checking if there’s a relationship between these two categorical variables.

## Graph Selection Cheat Sheet

| **Graph Type**        | **Use When You Want To...**                                               | **Best For**                        |
| --------------------- | ------------------------------------------------------------------------- | ----------------------------------- |
| **Bar Chart**         | Compare values across categories                                          | Categorical comparisons             |
| **Line Chart**        | Show trends or changes over time                                          | Time-series data                    |
| **Histogram**         | Visualize the distribution of a single numeric variable                   | Distribution of continuous data     |
| **Box Plot**          | Compare distributions, spot outliers and medians                          | Statistical summaries per category  |
| **Pie Chart**         | Show parts of a whole (simple proportion data)                            | Proportions with few categories     |
| **Scatter Plot**      | Show relationships between two numeric variables                          | Correlation / relationship analysis |
| **Heatmap**           | Show correlation or data intensity in matrix form                         | Feature relationships, correlation  |
| **Violin Plot**       | Show data distribution + density across categories                        | Detailed distribution analysis      |
| **Area Chart**        | Show cumulative trends over time (like stacked line chart)                | Time-based part-to-whole insights   |
| **Stacked Bar Chart** | Show part-to-whole comparisons across categories                          | Multi-variable comparisons          |
| **Bubble Chart**      | Show relationship with 3 numeric variables (like scatter + size encoding) | Advanced comparisons                |
| **Pair Plot**         | Show relationships between all pairs of variables                         | Exploratory Data Analysis (EDA)     |
| **Choropleth Map**    | Visualize data across geographical regions                                | Geographic data analysis            |
| **Treemap**           | Visualize hierarchical data or proportions within categories              | Complex part-to-whole structures    |
