# Important python libraries

## 1. scipy
- fundamental library in python built on **numpy**
- used for scientific and technical computing
- provides a large number of **higher-level functions that operate on numpy arrays**

1. **Integration (scipy.integrate)**: Provides functions for integrating functions and solving differential equations.

2. **Optimization (scipy.optimize)**: Offers algorithms for function minimization, root finding, and curve fitting.

3. **Interpolation (scipy.interpolate)**: Allows for smooth interpolation of data points with various methods.

4. **Fourier Transforms (scipy.fftpack)**: Contains functions for computing fast Fourier transforms.

5. **Signal Processing (scipy.signal)**: Includes tools for signal processing: filtering, windowing, signal generation, etc.

6. **Linear Algebra (scipy.linalg)**: Provides more advanced linear algebra routines beyond those in numpy.linalg.

7. **Sparse Matrices (scipy.sparse)**: Includes tools for working with sparse matrices.

8. **Statistics (scipy.stats)**: Contains a large number of probability distributions, statistical functions, and tests.

9. **Multidimensional Image Processing (scipy.ndimage)**: Offers various functions for multi-dimensional image processing.

10. **Special Functions (scipy.special)**: Gives access to numerous mathematical functions like Bessel, Gamma, Beta, hypergeometric, etc.

In [2]:
#Example: to find minimum value of a simple function

from scipy.optimize import minimize

# Define the objective function
def objective_function(x):
    return (x - 2)**2 + 5

# Initial guess for the minimum (starting point for the optimization)
initial_guess = 0

# Use the minimize function to find the minimum
result = minimize(objective_function, initial_guess)

# Extract the optimal value of x
optimal_x = result.x[0]

# Print the result
print("Optimal value of x:", optimal_x)
print("Minimum value of the function:", result.fun)



Optimal value of x: 2.00000001888464
Minimum value of the function: 5.0


## 2. pandas

- Pandas is a powerful and flexible data analysis/manipulation library available in Python. 

- Pandas is an essential tool in the Python data science toolkit, often used in conjunction with libraries like NumPy, Matplotlib, and SciPy.

    - DataFrame Object: A primary data structure of pandas, DataFrame, is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns).

    - Series Object: A one-dimensional labeled array capable of holding any data type.

    - Handling Different Data Types: Pandas can easily handle a variety of data types, including floating point, integer, boolean, categorical, datetime, and more.

    - Data Alignment and Missing Data Handling: It has in-built support for automatically aligning data from different sources and handling missing data.

    - File I/O: Ability to read and write data from different file formats like CSV, Excel, JSON, HTML, HDF5, etc.

    - Data Cleaning and Transformation: Offers extensive functions for cleaning, transforming, and reshaping data.

    - Merging and Joining: Provides SQL-like operations for merging or joining data sets.

    - Grouping and Aggregation: Powerful grouping and aggregation capabilities for data summarization.

    - Time Series Analysis: In-built support for handling and analyzing time-series data.

    - Visualization: Simple wrapping of Matplotlib for quick and easy data visualization.

In [3]:
import pandas as pd

#create a dataframe using pandas

data = {'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'San Francisco', 'Los Angeles']
}

df = pd.DataFrame(data)
print(df)

#similar to R dataframe

#perform basic operation
print(df['Age'].mean())

      Name  Age           City
0    Alice   25       New York
1      Bob   30  San Francisco
2  Charlie   35    Los Angeles
30.0


Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


### Categorical data

- In the context of data analysis and statistics, "categorical" refers to a type of data that can take on one of a limited, and usually fixed, number of possible values. These values represent discrete categories or groups. Categorical data is often used to represent characteristics such as a person's gender, a car's color, or a student's grade level.

- There are two main types of categorical data:

    - Nominal Data: This is categorical data where the order of the categories is not significant. For example, the colors of cars (red, blue, green) are nominal because the order in which you list these colors does not carry any intrinsic meaning.

    - Ordinal Data: This type of categorical data involves some order; the categories have a logical sequence. For example, ratings such as "good", "better", "best" have an inherent order. The order is significant and meaningful.

- In data analysis, especially with libraries like Pandas in Python, it's often important to specify when data is categorical because it informs the analysis and visualization techniques you might use. For example, certain statistical models or plots are more appropriate for categorical data.

- In Pandas, you can specify that a column in a DataFrame is categorical by converting it to the category dtype.

  - This conversion can lead to more efficient storage and faster operations, especially for large datasets, as Pandas uses an optimized internal representation for categorical data.

In [4]:
import pandas as pd

df = pd.DataFrame({'grade': ['A', 'B', 'C', 'D', 'F']})
df['grade'] = df['grade'].astype('category')

## 3. StatsModels

- Linear Models: It offers various options for linear regression, including Ordinary Least Squares (OLS), Generalized Least Squares (GLS), and robust linear models.

- Generalized Linear Models: For data that do not fit the normal distribution, statsmodels includes Generalized Linear Models (GLM) such as logistic regression for binary outcomes.

- Time Series Analysis: It has tools for the estimation of time series models, including ARIMA (AutoRegressive Integrated Moving Average) and state-space models.

- Nonparametric Methods: Statsmodels provides nonparametric statistics methods, which can be useful when the data does not meet the assumptions required by parametric methods.

- Statistical Tests: The module includes a variety of statistical tests for different purposes, such as t-tests, chi-square tests, and ANOVA.

- Plotting Functions: It offers various plotting functions for visualizing results, like autocorrelation plots, partial autocorrelation plots, and qq-plots.

- Datasets: Statsmodels includes several datasets which can be used for examples and testing models.

- Extensive Output of Model Summaries: The summary output of models in statsmodels is comprehensive, showing various statistical metrics that help in interpreting the model's performance.


In [5]:
import statsmodels.api as sm
import pandas as pd

# Example data
df = pd.DataFrame({
    'X': [1, 2, 3, 4, 5],
    'Y': [2, 4, 5, 4, 5]
})

# Ordinary Least Squares (OLS) regression
model = sm.OLS(df['Y'], sm.add_constant(df['X'])).fit()

# View the summary of the model
print(model.summary())

#In this example, an OLS regression model is fitted to the data, 
#and then the summary of the model is printed, 
#which includes various statistical measures and tests to assess the model's validity and performance

                            OLS Regression Results                            
Dep. Variable:                      Y   R-squared:                       0.600
Model:                            OLS   Adj. R-squared:                  0.467
Method:                 Least Squares   F-statistic:                     4.500
Date:                Thu, 01 Aug 2024   Prob (F-statistic):              0.124
Time:                        13:49:16   Log-Likelihood:                -5.2598
No. Observations:                   5   AIC:                             14.52
Df Residuals:                       3   BIC:                             13.74
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          2.2000      0.938      2.345      0.1

  warn("omni_normtest is not valid with less than 8 observations; %i "
