#### ✅ What is Statsmodels?
#### Statsmodels is a Python library used for statistical modeling, hypothesis testing, and data exploration. It is especially useful for:

#### Linear regression

#### Time series analysis

#### ANOVA tests

#### Generalized linear models (GLM)

#### Statistical tests



##### Main Uses of Statsmodels
##### Area	         => Use Case
##### Regression	=>  Linear and logistic regression
##### Time Series	=>  ARIMA, SARIMA models
##### Hypothesis Testing	=>  T-tests, Chi-square tests
##### Statistical Modeling	=>  GLMs, ANOVA
##### Diagnostics	=>  Residual plots, p-values, confidence intervals

####  Common Functions and Modules in Statsmodels
#### Function / Class	Description
#### OLS() =>	Ordinary Least Squares Regression
#### Logit() =>	Logistic Regression
#### GLM()	=> Generalized Linear Models
#### anova_lm() =>	ANOVA
#### t_test(), f_test() =>	Hypothesis tests
#### ARIMA()	=> ARIMA for time series

In [5]:
### Example 1: Linear Regression (OLS)

import statsmodels.api as sm
import numpy as np

# sample data 
X = np.array([1,2,3,4,5])
y = np.array([2,4,5,4,5])

# add constant (intercept)
X = sm.add_constant(X)
# fit model
model = sm.OLS(y,X).fit()

#summary of regression
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.600
Model:                            OLS   Adj. R-squared:                  0.467
Method:                 Least Squares   F-statistic:                     4.500
Date:                Thu, 10 Apr 2025   Prob (F-statistic):              0.124
Time:                        15:58:50   Log-Likelihood:                -5.2598
No. Observations:                   5   AIC:                             14.52
Df Residuals:                       3   BIC:                             13.74
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          2.2000      0.938      2.345      0.1

  warn("omni_normtest is not valid with less than 8 observations; %i "


In [9]:
#### 2 Logistic Regression
import statsmodels.api as sm
import numpy as np

# Data
X = np.array([[1],[2],[3],[4],[5]])
y = np.array([0,0,1,1,1])

X = sm.add_constant(X)

#fit logistic model
model = sm.Logit(y, X).fit()
print(model.summary())

         Current function value: 0.000000
         Iterations: 35
                           Logit Regression Results                           
Dep. Variable:                      y   No. Observations:                    5
Model:                          Logit   Df Residuals:                        3
Method:                           MLE   Df Model:                            1
Date:                Thu, 10 Apr 2025   Pseudo R-squ.:                   1.000
Time:                        16:01:34   Log-Likelihood:            -5.0138e-10
converged:                      False   LL-Null:                       -3.3651
Covariance Type:            nonrobust   LLR p-value:                  0.009480
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const       -110.4353   2.23e+05     -0.000      1.000   -4.38e+05    4.38e+05
x1            44.2438   9.07e+04      0.000      1.000   -1.78e+0



In [11]:
##  Example 3: Time Series - ARIMA

import statsmodels.api as sm

# load air passengers dataset
data = sm.datasets.get_rdataset("AirPassengers").data["value"]

# fit ARIMA model
model = sm.tsa.ARIMA(data, order=(1,1,1)).fit()
print(model.summary())

# forecast
forecast = model.forecast(steps=5)
print(forecast)

                               SARIMAX Results                                
Dep. Variable:                  value   No. Observations:                  144
Model:                 ARIMA(1, 1, 1)   Log Likelihood                -694.341
Date:                Thu, 10 Apr 2025   AIC                           1394.683
Time:                        16:04:03   BIC                           1403.571
Sample:                             0   HQIC                          1398.294
                                - 144                                         
Covariance Type:                  opg                                         
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1         -0.4742      0.123     -3.847      0.000      -0.716      -0.233
ma.L1          0.8635      0.078     11.051      0.000       0.710       1.017
sigma2       961.9270    107.433      8.954      0.0

In [13]:
## Example 4: Hypothesis Testing

from statsmodels.stats.weightstats import ttest_ind

# two sample sets
data1 = [22,25,29,31,26]
data2 = [28,27,31,30,32]

t_stat, p_val, df = ttest_ind(data1, data2)
print("t-statistic:", t_stat)
print("p-value:",p_val)


t-statistic: -1.6464638998453551
p-value: 0.13828488323742122


#### ❌ Drawbacks of Statsmodels

#### Limitation	=> Explanation
#### 🧠 Less Beginner-Friendly => 	Syntax is more academic/statistical
#### 🧮 Manual Data Prep => 	Requires explicit steps like adding constants
#### 📉 Less Support for Large ML => 	Not suitable for deep learning or complex models
#### 📦 Limited Visualization => 	Lacks inbuilt plots compared to Seaborn or Matplotlib
#### ⏱️ Slower for Large Datasets	=>  Not optimized for massive-scale data like Scikit-Learn or TensorFlow

#### ✅ When to Use Statsmodels vs. Scikit-learn
#### Task	Use Statsmodels	Use Scikit-learn
#### Statistical analysis	                      ✅❌
#### P-values, confidence intervals	             ✅	❌
#### Machine learning models	                ❌	✅
#### Cross-validation, pipeline, grid search   ❌	✅

#### Summary
####  Feature	Statsmodels
####  Ideal For	=> Regression, statistical tests, time series
####  Strength	=>  Detailed model output (p-values, R², CI)
####  Weakness	=> Less for ML, no deep learning, less automation
####  Popular Fun ctions => 	OLS(), Logit(), anova_lm(), ARIMA()