# Hypothesis Testing

Single t-test
- **p_val = stats.ttest_1samp(df[col], popmean=pop_mean)**

Single z-test
- **z_stat, p_val = ztest(df[col], value=100)**

In [1]:
from scipy import stats
from statsmodels.stats.weightstats import ztest
from scipy.stats import chi2_contingency

# df: your DataFrame
# col: numeric column, e.g., "systolic_bp"
# pop_mean: the reference mean to test against

#t_stat, p_val = stats.ttest_1samp(df[col].dropna(), popmean=pop_mean)


Independent and Paired t-tests:
- **t_stat, p_val = stats.ttest_ind(g1_vals, g2_vals, equal_var=False)**
- **t_stat, p_val = stats.ttest_rel(paired[pre_col], paired[post_col])**

Create columns based on categories by filtering: **group_a = df[df['Category'] == 'A']['Quantity']**

In [2]:
# group_col: categorical column with two groups, e.g., "treatment_group" âˆˆ {"A","B"}
# value_col: numeric outcome, e.g., "length_of_stay_days"

Chi-sqaured test
- **chi2, p, dof, expected = chi2_contingency(ct)**

In [3]:
#ct = pd.crosstab(df[row_cat], df[col_cat])
#chi2, p, dof, expected = chi2_contingency(ct)

Z-tests
- **z_stat, p_val = ztest(df[value_col].dropna(), value=benchmark)**
- **z_stat, p_val = ztest(g1, g2)**

In [4]:
#z_stat, p_val = ztest(g1, g2)

# Linear Regression

### 1. Import Libraries

In [5]:
#import statsmodels.api as sm
#pip install statmodels

import statsmodels.api as sm
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### 2. Import Data

```python 
df = pd.read_excel(file) # if excel file
df = pd.read_csv(file)   # if csv file
```

### 3. Data Cleaning

#### 3.1 Basic Overview

```python 
df.info()
```

#### 3.2 Datatypes

#### 3.3 Missing values (Nulls)
```python 
df.dropna() # Remove rows
df.dropna(axis=1) # Remove columns
df[col] = df[col].fillna(np.mean(df[col])) # Fill with mean
```

#### 3.4 Outliers

```python 
df.describe()   # descriptive statistics
filtered_df = df[(df[column] > value) # Filter outliers 
```

#### 3.5 Duplicates

```python
df.drop_duplicates # Drop Duplicates
```

### 4. EDA

- **Drop nominal variabless:** new_df = df.drop([List of columns], axis = 1)

#### Convert Categorical data to numerical

```python

from sklearn.preprocessing import LabelEncoder, OrdinalEncoder

# Categories to Numbers (random mapping) 
le = LabelEncoder()
df[updated col] = le.fit_transform(df[col])
mapping = dict(zip(le.classes_, le.transform(le.classes_)))

# Categories to Numbers (manual mapping) 
order = [['S', 'M', 'L', 'XL']]
ord_enc = OrdinalEncoder(categories=order)
df[updated col] = ord_enc.fit_transform(df[[col]])
mapping = {cat: i for i, cat in enumerate(order[0])}

# Create new columns for each category
new_df = pd.get_dummies(df, columns=[list of columns])
```

#### Correlation

```python
corr = df.corr()
sns.heatmap(corr, annot=True, fmt=".2f", cmap="coolwarm", center=0)
plt.title("Correlation Heatmap")
plt.show()
```

- **Correlation**: which predictors are correlated with target?
- **Multicollinearity**: which predictors are correlated among themselves?


### 5. Create input & output

```python
X = df.drop(targer, axis=1)
y = df[[target]]
```

### 6.. Develop the Regression Model

```python
X_reg = sm.add_constant(X) # adding a constant
reg = sm.OLS(y, X_reg).fit()
pred = reg.predict(X_reg)
reg.summary()
```