# Modules for Analysis

## Scipy {.unnumbered}

Scipy is a library for scientific and technical computing. It is built on top of Numpy. Scipy has many methods for numerical integration, optimization, interpolation, linear algebra, signal processing, image processing, etc.    <br>
More information can be found under [Scipy website](https://www.scipy.org/).

Some of the subpackages are `scipy.cluster`, `scipy.constants`, `scipy.fftpack`, `scipy.integrate`, `scipy.interpolate`, `scipy.io`, `scipy.linalg`, `scipy.ndimage`, `scipy.odr`, `scipy.optimize`, `scipy.signal`, `scipy.sparse`, `scipy.spatial`, `scipy.special`, `scipy.stats`, `scipy.weave`.

You can use the `scipy.stats` subpackage to perform statistical analysis. You can calculate the mean, median, mode, standard deviation, variance, skewness, kurtosis, etc. You can also perform hypothesis tests, like t-test, chi-square test, f-test, etc. You can also perform statistical modeling, like linear regression, logistic regression, etc.

In [113]:
# example of scipy.stats
import scipy.stats as stats

# create a normal distributed random variable
rvs = stats.norm.rvs(size=1000)
print("statistical summary: \n", stats.describe(rvs))


statistical summary: 
 DescribeResult(nobs=1000, minmax=(-3.1200489778662845, 3.4799474574915226), mean=0.03012725467181336, variance=0.9874106047908483, skewness=0.16006045274259637, kurtosis=0.17838129320176854)


In [114]:
from scipy.stats import linregress
# linear regression
def fun(x, m, b):
    return m*x + b

x = np.linspace(0, 10, 100) 
y = 8*x + 10*np.random.normal(0, 1, 100)
res = linregress(x, y)
print("slope: ", res.slope, "+/-", res.stderr)
print("intercept: ", res.intercept, "+/-", res.intercept_stderr)
print("rvalue: ", res.rvalue)
print("pvalue: ", res.pvalue)
from scipy.stats import t
# Two-sided inverse Studentâ€™s t-distribution
# p probability, df degrees of freedom
tinv = lambda p, df: abs(t.ppf(p/2, df))
# 95% confidence interval
ts = tinv(0.05, len(x)-2)
print(f"intercept (95%): {res.intercept:.6f}" f" +/- {ts*res.intercept_stderr:.6f}")

slope:  8.244088114253186 +/- 0.3297107686445956
intercept:  -1.4285450637070412 +/- 1.9083869914860114
rvalue:  0.9297801648805708
pvalue:  2.526825425057965e-44
intercept (95%): -1.428545 +/- 3.787132


You can use the `scipy.optimize` subpackage to perform optimization. You can minimize or maximize a function. You can also perform non-linear least squares fitting, curve fitting, etc.

In [115]:
import scipy.optimize as opt

# define a function
def linear_func(x, m, b):
    return m*x + b

# generate some data
x = np.linspace(0, 10, 100) 
y = 8*x + 10*np.random.normal(0, 1, 100)

# fit the data
# non-linear least squares
pop, pcov = opt.curve_fit(linear_func, x, y)
print("pop: ", pop)
print("pcov: ", pcov)
print("perr: ", np.sqrt(np.diag(pcov)))
print("m: ", pop[0] , " +/- ", np.sqrt(pcov[0,0]))
print("b: ", pop[1] , " +/- ", np.sqrt(pcov[1,1]))


pop:  [ 8.08884688 -1.63434236]
pcov:  [[ 0.12089433 -0.60447165]
 [-0.60447165  4.05016356]]
perr:  [0.34769862 2.01250182]
m:  8.08884687649677  +/-  0.34769862200054785
b:  -1.6343423576530771  +/-  2.0125018162642294


you can use the `scipy.interpolate` subpackage to perform interpolation. You can interpolate 1D, 2D, and 3D data. You can also perform spline fitting, etc.

You can use the `scipy.linalg` subpackage to perform linear algebra operations. You can calculate the determinant, inverse, eigenvalues, eigenvectors, etc. You can also solve linear systems of equations, etc.

In [116]:
from scipy.linalg import solve
# solve the linear equation
A = np.array([[1, 2], [3, 4]])
b = np.array([1, 2])
x = solve(A, b)
print("x: ", x)
print("A*x: ", np.dot(A, x))
print("b: ", b)
print("Eigenvalues: ", np.linalg.eigvals(A))
print("Eigenvectors: ", np.linalg.eig(A))
print("Determinant: ", np.linalg.det(A))
print("Inverse: ", np.linalg.inv(A))



x:  [0.  0.5]
A*x:  [1. 2.]
b:  [1 2]
Eigenvalues:  [-0.37228132  5.37228132]
Eigenvectors:  EigResult(eigenvalues=array([-0.37228132,  5.37228132]), eigenvectors=array([[-0.82456484, -0.41597356],
       [ 0.56576746, -0.90937671]]))
Determinant:  -2.0000000000000004
Inverse:  [[-2.   1. ]
 [ 1.5 -0.5]]


You can use the `scipy.constants` subpackage to access physical and mathematical constants.

In [117]:
import scipy.constants as const
print("speed of light: ", const.c)
print("Planck constant: ", const.h)
print("Boltzmann constant: ", const.k)
print("Avogadro constant: ", const.N_A)

speed of light:  299792458.0
Planck constant:  6.62607015e-34
Boltzmann constant:  1.380649e-23
Avogadro constant:  6.02214076e+23


In [118]:
from scipy.constants import physical_constants
print("speed of light: ", physical_constants["speed of light in vacuum"])

speed of light:  (299792458.0, 'm s^-1', 0.0)


## Statsmodels {.unnumbered}

Statsmodels is a library for statistical modeling and testing. It is built on top of Numpy and Scipy. Statsmodels has many methods for statistical modeling, hypothesis testing, time series analysis, etc. <br>

For more information [click here](https://www.statsmodels.org/stable/index.html).

In [119]:
import statsmodels.api as sm
import statsmodels.formula.api as smf

ModuleNotFoundError: No module named 'statsmodels'

### Linear regression {.unnumbered}

In [None]:
import numpy as np
import statsmodels.api as sm

data = sm.datasets.get_rdataset("mtcars").data
data = sm.add_constant(data)
model = sm.OLS(data["mpg"], data[["const", "wt"]])
results = model.fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                    mpg   R-squared:                       0.753
Model:                            OLS   Adj. R-squared:                  0.745
Method:                 Least Squares   F-statistic:                     91.38
Date:                Fri, 01 Mar 2024   Prob (F-statistic):           1.29e-10
Time:                        00:03:06   Log-Likelihood:                -80.015
No. Observations:                  32   AIC:                             164.0
Df Residuals:                      30   BIC:                             167.0
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         37.2851      1.878     19.858      0.0

## Scikit-learn {.unnumbered}

`Scikit-learn` is a library for machine learning. It is built on top of Numpy, Scipy, and Statsmodels. `Scikit-learn` has many methods for supervised learning, unsupervised learning, clustering, dimensionality reduction, etc. <br>

For more information [click here](https://scikit-learn.org/stable/)

In [None]:
import sklearn as sk

Examples of supervised learning are linear regression, logistic regression, decision trees, random forests, support vector machines, k-nearest neighbors, etc. <br>

### Linear regression {.unnumbered}

In [None]:
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
# Load the diabetes dataset 

diabetes = datasets.load_diabetes()

# Use only one feature
diabetes_X = diabetes.data[:, np.newaxis, 2]

# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]

# Split the targets into training/testing sets
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]

# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)

# Make predictions using the testing set
diabetes_y_pred = regr.predict(diabetes_X_test)

# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean squared error
print('Mean squared error: %.2f'
      % mean_squared_error(diabetes_y_test, diabetes_y_pred))
# The coefficient of determination: 1 is perfect prediction

print('Coefficient of determination: %.2f'
        % r2_score(diabetes_y_test, diabetes_y_pred))



Coefficients: 
 [938.23786125]
Mean squared error: 2548.07
Coefficient of determination: 0.47
