# Key Statistical Concepts

## Coefficient
### Definition
- A coefficient in regression analysis represents the change in the dependent variable for a one-unit change in the independent variable, holding all other variables constant.
- It indicates the strength and direction of the relationship between an independent variable and the dependent variable.

### Interpretation
- Positive coefficient: Indicates a direct relationship between the independent variable and the dependent variable.
- Negative coefficient: Indicates an inverse relationship between the independent variable and the dependent variable.
- Magnitude: Indicates the size of the effect the independent variable has on the dependent variable.

```python
import statsmodels.api as sm

model = sm.OLS(Y, X).fit()
coefficients = model.params
print(coefficients)

```
[ Y = B0 + B1*X + B2*X2 ... BN*XN + E\]
- B = Coeffieience
- E = Error Term

## P-value
### Definition
- P-value is the probability of obtaining test results at least as extreme as the observed results, assuming that the null hypothesis is true.
- It helps determine the statistical significance of the results.

### Interpretation
- Low p-value (< 0.05): Strong evidence against the null hypothesis, so you reject the null hypothesis.
- High p-value (> 0.05): Weak evidence against the null hypothesis, so you fail to reject the null hypothesis.
- Exactly 0.05: The threshold level of significance.

```python
import statsmodels.api as sm
X = sm.add_constant(X)
model = sm.OLS(Y, X).fit()
p_values = model.pvalues
print(p_values)
```

## R² Value (Coefficient of Determination)

### Definition
- **R² (R-squared)** is a statistical measure that represents the proportion of the variance in the dependent variable that is predictable from the independent variable(s).
- It ranges from 0 to 1.

### Interpretation
- **0**: The independent variables do not explain any of the variability in the dependent variable.
- **1**: The independent variables explain all of the variability in the dependent variable.
- **Closer to 1**: Indicates a better fit of the model to the data.
- **Closer to 0**: Indicates a poor fit of the model to the data.

### Formula
[ R^2 = 1 - \frac{SS_{res}}{SS_{tot}} \]
- \( SS_{res} \): Sum of squares of residuals (unexplained variation).
- \( SS_{tot} \): Total sum of squares (total variation in the data).

### Example
```python
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
X = df[['highway-mpg']]
Y = df['price']
lm.fit(X, Y)
R_squared = lm.score(X, Y)
print(R_squared)
```

# Model Development Pipeline

## 1. Data Collection
- Gather data from various sources (databases, APIs, CSV files, etc.).

## 2. Data Preprocessing
- **Load Data**: Import data into your analysis environment (e.g., pandas DataFrame in Python).
- **Clean Data**: Handle missing values, remove duplicates, and correct data types.
- **Feature Engineering**: Create new features, normalize/standardize data, and encode categorical variables.

```python
import pandas as pd

# Load data
df = pd.read_csv('data.csv')

# Clean data
df.dropna(inplace=True)
df.drop_duplicates(inplace=True)

# Feature engineering
df['new_feature'] = df['existing_feature'] ** 2
df = pd.get_dummies(df, columns=['categorical_feature'])

``` 
### 3. Exploratory Data Analysis (EDA)
- Descriptive Statistics: Summarize data using mean, median, mode, etc.
- Visualizations: Create plots (scatter plots, histograms, box plots) to understand data distribution and relationships.

```python
import matplotlib.pyplot as plt
import seaborn as sns

# Descriptive statistics
print(df.describe())

# Visualizations
sns.pairplot(df)
plt.show()

```

### 4. Feature Selection
- Identify and select important features using correlation analysis, variance threshold, or more advanced techniques like Recursive Feature Elimination (RFE).

```python
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

# Feature selection using RFE
model = LinearRegression()
rfe = RFE(model, n_features_to_select=5)
fit = rfe.fit(X, Y)
selected_features = X.columns[fit.support_]

```
### 5. Model Development
- Split Data: Divide data into training and testing sets.
- Train Model: Fit the model to the training data.
- Evaluate Model: Use R² value, p-values, and coefficients to evaluate the model.

```python

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm

# Split data
X = df[selected_features]
Y = df['target']
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

# Train model
model = LinearRegression()
model.fit(X_train, Y_train)

# Evaluate model
R_squared = model.score(X_test, Y_test)
print(f"R² value: {R_squared}")

# Using statsmodels for p-values and coefficients
X_train_sm = sm.add_constant(X_train)
sm_model = sm.OLS(Y_train, X_train_sm).fit()
print(sm_model.summary())

```


### 6. Model Evaluation

- Residual Analysis: Check residual plots for homoscedasticity and patterns.
- Performance Metrics: Calculate metrics like RMSE, MAE, and adjusted R².
- Validation: Perform cross-validation to assess model generalization.


```python

from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.model_selection import cross_val_score

# Residual analysis
residuals = Y_test - model.predict(X_test)
sns.residplot(x=Y_test, y=residuals, lowess=True)
plt.xlabel('Fitted values')
plt.ylabel('Residuals')
plt.show()

# Performance metrics
Y_pred = model.predict(X_test)
rmse = mean_squared_error(Y_test, Y_pred, squared=False)
mae = mean_absolute_error(Y_test, Y_pred)
adjusted_R_squared = 1 - (1-R_squared)*(len(Y_test)-1)/(len(Y_test)-X_test.shape[1]-1)
print(f"RMSE: {rmse}, MAE: {mae}, Adjusted R²: {adjusted_R_squared}")

# Cross-validation
cv_scores = cross_val_score(model, X, Y, cv=5, scoring='r2')
print(f"Cross-validated R²: {cv_scores.mean()}")

```

# Fast Summary

## R² Value
- What it is: The R² value tells us how well our model explains the variation in the data. It ranges from 0 to 1.
- If the R² value is close to 1, it means our model is very good at predicting the target variable (price). If it’s close to 0, it means the model isn’t very good at making predictions.
- Example: If we get an R² value of 0.8, it means 80% of the variability in car prices can be explained by the model using highway-mpg.

## p-value
- What it is: The p-value helps us determine if the results we see are statistically significant.
- In simple terms: A small p-value (typically less than 0.05) means that the feature (e.g., highway-mpg) has a significant impact on the target variable (price). If the p-value is large, it means the feature might not be important.
- Example: If the p-value for highway-mpg is 0.01, it means that highway-mpg is **likely a significant factor** in determining car prices.

## Coefficients
- What it is: Coefficients tell us how much the **target variable** (price) changes when the feature (e.g., highway-mpg) changes by one unit.
- In simple terms: If the coefficient is positive, it means that as the feature increases, the target variable also increases. If it’s negative, the target variable decreases.
- Example: If the coefficient for highway-mpg is -1000, it means that for each additional mile per gallon on the highway, the price of the car decreases by 1000.

## Mean Squared Error (MSE)
- What it is: MSE measures the average squared difference between the actual prices and the prices predicted by the model.
- In simple terms: A smaller MSE means our predictions are closer to the actual prices, indicating a better model.
- Example: If the MSE is 1500, it means that, on average, the difference between the actual car prices and the predicted prices is $1500 squared.