

### Descriptive Statistics Part 1

#### What is Statistics?
Statistics is the science of collecting, analyzing, interpreting, and presenting data. It provides tools for understanding data and making informed decisions based on data analysis.

**Examples**:
1. **Analyzing Historical Stock Prices**: A trader collects historical stock prices to determine trends and predict future prices.
2. **Volatility Analysis**: Calculating the volatility of different stocks to assess their risk.
3. **Performance Metrics**: Analyzing the performance metrics of various trading strategies to determine their effectiveness.
4. **Market Sentiment**: Analyzing news articles and social media posts to gauge market sentiment.

**Python Example**:
```python
import pandas as pd

# Collecting historical stock price data
data = pd.read_csv('historical_stock_prices.csv')
print(data.head())
```

#### Types of Statistics
1. **Descriptive Statistics**: Summarizes or describes the characteristics of a dataset.
2. **Inferential Statistics**: Makes inferences about populations based on samples.

**Examples**:
1. **Descriptive**: 
- Calculating the average daily trading volume of a stock over a month.
- Creating a histogram of daily returns for a stock to understand its volatility.
2. **Inferential**: 
- Using sample data from a specific trading   period  to estimate the performance of a trading algorithm.
- Performing a hypothesis test to determine if a new trading strategy is more effective than the existing one.

**Python Example**:
```python
import pandas as pd

# Descriptive statistics
data = pd.read_csv('historical_stock_prices.csv')
mean_price = data['Close'].mean()
median_price = data['Close'].median()
print(f"Mean Price: {mean_price}, Median Price: {median_price}")

# Inferential statistics (e.g., t-test)
from scipy.stats import ttest_ind

sample1 = data['Close'][:50]
sample2 = data['Close'][50:100]
t_stat, p_val = ttest_ind(sample1, sample2)
print(f"T-statistic: {t_stat}, P-value: {p_val}")
```

#### Population vs Sample
- **Population**: The entire set of subjects or items that you are interested in studying.
- **Sample**: A subset of the population used to make inferences about the population.

**Examples**:
1. **Population**: 
- All trades made in the stock market in a year.
- All customers of a trading platform.
2. **Sample**: 
- Trades made in the stock market in January.
- Customers who executed trades during a specific week.

**3 Important factors need to be considered while creating samples**:
1. **Sample Size** 
2. **Random**
3. **Representative**


**Python Example**:
```python
import pandas as pd

# Population data (all trades in a year)
data = pd.read_csv('all_trades_year.csv')

# Sample data (trades in January)
sample_data = data[data['Date'].str.contains('2023-01')]
print(sample_data.head())
```


### Parameter vs Statistic

In the context of statistics, a **parameter** is a numerical value that describes a characteristic of a population, whereas a **statistic** is a numerical value that describes a characteristic of a sample taken from that population.

#### Parameter
- A parameter is a fixed, often unknown number that describes some characteristic of a population.
- Examples include population mean (μ), population variance (σ²), population proportion (p), etc.

#### Statistic
- A statistic is a numerical value that describes some characteristic of a sample.
- Examples include sample mean (\(\overline{x}\)), sample variance (s²), sample proportion (\(\hat{p}\)), etc.

### Inferential Statistics

Inferential statistics involves making predictions or inferences about a population based on a sample of data drawn from that population. It uses various methods to analyze sample data and make generalizations about the larger population.


### Subtopics Under Inferential Statistics 

1. **Hypothesis Testing**
   - **Purpose**: To evaluate the performance or efficacy of trading strategies.
   - **Example**: Testing whether a new trading algorithm has a statistically significant difference in average returns compared to an existing algorithm.

2. **Confidence Intervals**
   - **Purpose**: To estimate the uncertainty of trading metrics such as expected returns, volatility, and drawdowns.
   - **Example**: Estimating the confidence interval for the average yearly return of a trading strategy to gauge its reliability.

3. **Regression Analysis**
   - **Purpose**: To model the relationships between market variables or to predict future price movements based on historical data.
   - **Example**: Using multiple linear regression to predict stock prices based on economic indicators like GDP growth rate, interest rates, and unemployment rates.

4. **Analysis of Variance (ANOVA)**
   - **Purpose**: To compare the performance of multiple trading strategies across different market conditions.
   - **Example**: Determining if there are significant differences in the returns of different sector-based trading strategies during different economic cycles.

5. **Chi-Square Tests**
   - **Purpose**: To analyze categorical data within market research or trading patterns.
   - **Example**: Testing if there is a significant relationship between the day of the week and the frequency of trading anomalies.

6. **Sampling Techniques**
   - **Purpose**: To ensure representative and unbiased samples for back-testing trading strategies.
   - **Example**: Employing stratified random sampling to select a diverse range of historical data periods for robust strategy testing.

7. **Bayesian Statistics**
   - **Purpose**: To update probabilities of trading outcomes as new data becomes available, accommodating for changing market conditions.
   - **Example**: Using Bayesian inference to update the probability of a stock's outperformance based on new quarterly earnings reports.

8. **Time Series Analysis**
   - **Purpose**: To analyze sequential data points collected over time, crucial for financial data analysis.
   - **Example**: Applying ARIMA models to forecast future stock prices or volatility patterns.

9. **Monte Carlo Simulations**
   - **Purpose**: To assess risk and uncertainty in prediction models by simulating a wide range of possible outcomes.
   - **Example**: Using Monte Carlo simulations to estimate the potential drawdown of a trading strategy under various market scenarios.

10. **Causal Inference**
    - **Purpose**: To determine causal relationships rather than mere correlations, important for strategy validation.
    - **Example**: Identifying whether changes in interest rates directly cause shifts in stock market indices.






#### Types of Data
1. **Quantitative Data**: Numeric data that can be measured.
   - **Discrete**: Countable data, like the number of trades made.
   - **Continuous**: Measurable data, like the price of a stock.
2. **Qualitative Data**: Descriptive data that can be categorized.
   - **Nominal**: Categories without a specific order, like types of stocks.
   - **Ordinal**: Categories with a specific order, like stock ratings.

**Examples**:
1. **Quantitative - Discrete**: Number of trades made per day.
2. **Quantitative - Continuous**: Daily closing price of a stock.
3. **Qualitative - Nominal**: Sectors of stocks (e.g., technology, healthcare).
4. **Qualitative - Ordinal**: Stock ratings (e.g., AAA, AA, A).

**Python Example**:
```python
import pandas as pd

# Quantitative data
data = pd.read_csv('trades_data.csv')
print(data['Volume'].describe())  # Discrete
print(data['Close'].describe())   # Continuous

# Qualitative data
print(data['Sector'].value_counts())  # Nominal
print(data['Rating'].value_counts())  # Ordinal
```

### ** Measure of Central Tendency **


### 1. **Mean (Arithmetic Average)**
   - **Formula**: The mean is calculated by adding up all the numbers in the data set and then dividing by the number of data points:
   \[
   \text{Mean} = \frac{\sum_{i=1}^n x_i}{n}
   \]
   where \( x_i \) are the values in the dataset and \( n \) is the number of values.

### 2. **Median**
   - **Formula**: If the number of observations \( n \) is odd, the median is the middle value. If \( n \) is even, it is the average of the two middle numbers. Mathematically, it's not expressed in a simple formula but determined through the ordered data set.

### 3. **Mode**
   - **Formula**: The mode is the value or values in the data set that appear most frequently. It is more of a counting and comparison operation rather than a formulaic calculation.

### 4. **Weighted Mean**
   - **Formula**: The weighted mean considers the importance (weight) of each value:
   \[
   \text{Weighted Mean} = \frac{\sum_{i=1}^n (w_i \times x_i)}{\sum_{i=1}^n w_i}
   \]
   where \( w_i \) are the weights assigned to each value \( x_i \).

### 5. **Trimmed Mean**
   - **Description**: The trimmed mean is calculated by removing a certain percentage of the smallest and largest values from the data set, and then calculating the mean of the remaining data. This is particularly useful in reducing the effect of outliers or extreme values.
   - **Formula**: If you trim \( p\% \) of data from both ends of an ordered dataset, you remove \( p\% \) of \( n \) observations from both the lower and upper ends. The trimmed mean is then:
   \[
   \text{Trimmed Mean} = \frac{\sum_{i=k+1}^{n-k} x_i}{n-2k}
   \]
   where \( x_i \) are the ordered values, \( n \) is the total number of observations, and \( k \) is the number of observations to trim from each end (\( k = \frac{p}{100} \times n \)).

### Application in Algorithmic Trading

- **Mean and Weighted Mean**: Useful for calculating average prices, returns, or other financial indicators where all values are relevant or where recent values need more emphasis.
- **Median**: Offers a robust measure of central tendency when data may be skewed by outliers, such as during market spikes.
- **Mode**: Helpful in identifying the most common values, which can indicate typical market behavior.
- **Trimmed Mean**: Effective in scenarios where outliers (due to market anomalies or errors in data collection) might skew the average, providing a more representative measure of central tendency.









### Setup and Data

First, we'll set up a sample dataset using `pandas`:

```python
import pandas as pd

# Sample dataset of daily returns (%) of a stock
data = {'Daily Returns': [2, -1, 3, 4, 2, 2, 100, 1, 0, 2]}
df = pd.DataFrame(data)
```

### 1. **Mean (Arithmetic Average)**

Calculating the mean using pandas:

```python
mean_return = df['Daily Returns'].mean()
print("Mean Daily Return:", mean_return)
```

### 2. **Median**

Calculating the median using pandas:

```python
median_return = df['Daily Returns'].median()
print("Median Daily Return:", median_return)
```

### 3. **Mode**

Calculating the mode using pandas:

```python
mode_return = df['Daily Returns'].mode()
print("Mode of Daily Returns:", mode_return.tolist())
```

### 4. **Weighted Mean**



```python
import numpy as np

weights = np.linspace(start=1, stop=len(df), num=len(df))
weighted_mean = np.average(df['Daily Returns'], weights=weights)
print("Weighted Mean of Daily Returns:", weighted_mean)
```

### 5. **Trimmed Mean**


```python
def trimmed_mean(series, percentage=0.1):
    # Determine the number of elements to cut from each end
    trim_count = int(len(series) * percentage)
    # Sort the series, drop the specified percentage from both ends, and calculate the mean
    return series.sort_values().iloc[trim_count:-trim_count].mean()

# Calculating trimmed mean of daily returns
trimmed_return = trimmed_mean(df['Daily Returns'], 0.1)
print("Trimmed Mean of Daily Returns:", trimmed_return)
```






###  Measures of Dispersion

**1. Range**
- **Description**: The range provides a simple measure of the overall spread between the smallest and largest values in a dataset.
- **Mathematical Formula**: 
  \[
  \text{Range} = \text{Maximum value} - \text{Minimum value}
  \]

**2. Variance**
- **Description**: Variance measures the average squared deviations from the mean, giving a sense of how spread out the data points are around the mean.
- **Mathematical Formula**: 
  \[
  \text{Variance} = \frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n-1}
  \]
  where \( x_i \) are the values and \( \bar{x} \) is the sample mean.

**3. Standard Deviation**
- **Description**: Standard deviation is the square root of the variance and provides a measure of the spread of data points around the mean in the same units as the data.
- **Mathematical Formula**: 
  \[
  \text{Standard Deviation} = \sqrt{\text{Variance}}
  \]

**4. Interquartile Range (IQR)**
- **Description**: The interquartile range is the difference between the 75th percentile (Q3) and the 25th percentile (Q1) of the data, focusing on the central spread of the dataset and minimizing the effect of outliers.
- **Mathematical Formula**: 
  \[
  \text{IQR} = Q3 - Q1
  \]

**5. Mean Absolute Deviation (MAD)**
- **Description**: Mean Absolute Deviation represents the average of the absolute deviations from the dataset's mean, providing a robust measure of variability that is less sensitive to outliers than variance.
- **Mathematical Formula**: 
  \[
  \text{MAD} = \frac{\sum_{i=1}^n |x_i - \bar{x}|}{n}
  \]
  where \( x_i \) are the values and \( \bar{x} \) is the sample mean.



```python
import pandas as pd

# Sample dataset of daily returns (%) of a stock
data = {'Daily Returns': [2, -1, 3, 4, 2, 2, 100, 1, 0, 2]}
df = pd.DataFrame(data)

# Calculate Range
range_return = df['Daily Returns'].max() - df['Daily Returns'].min()

# Calculate Variance
variance_return = df['Daily Returns'].var()

# Calculate Standard Deviation
std_dev_return = df['Daily Returns'].std()

# Calculate Interquartile Range
Q1 = df['Daily Returns'].quantile(0.25)
Q3 = df['Daily Returns'].quantile(0.75)
IQR = Q3 - Q1

# Calculate Mean Absolute Deviation
mad_return = df['Daily Returns'].mad()

# Print all results
print("Range of Daily Returns:", range_return)
print("Variance of Daily Returns:", variance_return)
print("Standard Deviation of Daily Returns:", std_dev_return)
print("Interquartile Range of Daily Returns:", IQR)
print("Mean Absolute Deviation of Daily Returns:", mad_return)
```




### Coefficient of Variation

The coefficient of variation (CV) is a measure of relative variability that describes the extent of variability in relation to the mean of the population. It is especially useful in comparing the degree of variation from one data series to another, even if the means are drastically different from each other.

### Mathematical Formula

The coefficient of variation is calculated as the ratio of the standard deviation to the mean, expressed as a percentage:

\[
\text{CV} = \left(\frac{\text{Standard Deviation}}{\text{Mean}}\right) \times 100\%
\]

This formula gives a normalized measure of dispersion, making it easier to compare variability across datasets with different units or scales.

### Python Example



```python
import pandas as pd

# Sample dataset of daily returns (%) of a stock
data = {'Daily Returns': [2, -1, 3, 4, 2, 2, 100, 1, 0, 2]}
df = pd.DataFrame(data)

# Calculate Mean
mean_return = df['Daily Returns'].mean()

# Calculate Standard Deviation
std_dev_return = df['Daily Returns'].std()

# Calculate Coefficient of Variation
cv_return = (std_dev_return / mean_return) * 100

# Print Coefficient of Variation
print("Coefficient of Variation of Daily Returns (%):", cv_return)
```

### Explanation

- **Mean**: The average of the data points.
- **Standard Deviation**: Measures the amount of variation or dispersion in the data set.
- **Coefficient of Variation (CV)**: Provides a standardized measure of dispersion. If the CV is high, it indicates a higher level of dispersion around the mean. Conversely, a lower CV indicates less dispersion.

### Application in Algorithmic Trading

In algorithmic trading, the coefficient of variation can be used to:
- **Risk Assessment**: Determine the risk associated with different trading strategies or assets by comparing their coefficients of variation. A higher CV might indicate a riskier asset.
- **Performance Evaluation**: Compare the performance of models or funds that may operate across different scales of returns.

By using CV, traders can make more informed decisions by understanding not just the variability of returns, but how that variability compares to the average returns, thereby adding an additional layer of risk management to their strategies.










### Graphs for Univariate Analysis

#### 1. **Frequency Distribution Table & Histogram**
   - **Mathematical Description**: A histogram displays the distribution of data by forming bins along the range of the data and then drawing bars to show the number of observations that fall in each bin.
   - **Python Code**:
     ```python
     import matplotlib.pyplot as plt
     import pandas as pd
     
     # Sample data
     data = {'Daily Returns': [2, -1, 3, 4, 2, 2, 100, 1, 0, 2]}
     df = pd.DataFrame(data)

     # Plotting Histogram
     plt.hist(df['Daily Returns'], bins=10, color='blue', alpha=0.7)
     plt.title('Histogram of Daily Returns')
     plt.xlabel('Daily Returns (%)')
     plt.ylabel('Frequency')
     plt.show()
     ```



####  2. **Cumulative Frequency**
- **Mathematical Description**: Cumulative frequency is used to determine the number of observations below a particular value in a dataset. It's calculated by successively adding each frequency from a frequency distribution table to the sum of its predecessors.
  
- **Python Code**:
  ```python
  import matplotlib.pyplot as plt
  import pandas as pd

  # Sample data
  data = {'Daily Returns': [2, -1, 3, 4, 2, 2, 100, 1, 0, 2]}
  df = pd.DataFrame(data)

  # Creating the frequency distribution for the 'Daily Returns'
  frequency, bins = np.histogram(df['Daily Returns'], bins=10, range=[df['Daily Returns'].min(), df['Daily Returns'].max()])

  # Calculating the cumulative frequency
  cumulative_frequency = np.cumsum(frequency)

  # Plotting the cumulative frequency graph
  plt.plot(bins[1:], cumulative_frequency, marker='o', linestyle='-', color='b')
  plt.title('Cumulative Frequency of Daily Returns')
  plt.xlabel('Daily Returns (%)')
  plt.ylabel('Cumulative Frequency')
  plt.grid(True)
  plt.show()
  ```





### Graphs for Bivariate Analysis

#### 1. **Categorical - Categorical: Contingency Table/Crosstab**
   - **Mathematical Description**: A contingency table (or crosstab) summarizes the relationship between two categorical variables by showing the counts of intersections.
   - **Python Code**:
     ```python
     import pandas as pd
     import matplotlib.pyplot as plt
     import seaborn as sns
     
     # Sample data
     data = {'Category': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'A', 'B', 'C'],
             'Outcome': ['Yes', 'No', 'Yes', 'Yes', 'No', 'No', 'Yes', 'No', 'Yes', 'No']}
     df = pd.DataFrame(data)

     # Creating a crosstab
     contingency_table = pd.crosstab(df['Category'], df['Outcome'])
     print(contingency_table)

     # Plotting a heatmap of the crosstab
     sns.heatmap(contingency_table, annot=True, cmap="YlGnBu")
     plt.title('Contingency Table of Category vs. Outcome')
     plt.show()
     ```

#### 2. **Numerical - Numerical: Scatter Plot**
   - **Mathematical Description**: A scatter plot maps individual data points for two numeric variables along two axes, providing a visual examination of the relationships or patterns in the data.
   - **Python Code**:
     ```python
     import matplotlib.pyplot as plt

     # Sample data
     data = {'Variable1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
             'Variable2': [2, 3, 2, 5, 7, 8, 9, 10, 12, 11]}
     df = pd.DataFrame(data)

     # Plotting Scatter Plot
     plt.scatter(df['Variable1'], df['Variable2'])
     plt.title('Scatter Plot of Variable1 vs Variable2')
     plt.xlabel('Variable1')
     plt.ylabel('Variable2')
     plt.grid(True)
     plt.show()
     ```

#### 3. **Categorical - Numerical**
   - **Mathematical Description**: This type of plot is used to visualize the relationship between a categorical variable and a numerical variable, often using box plots or bar charts to show distribution or average values respectively.
   - **Python Code**:
     ```python
     import seaborn as sns
     import matplotlib.pyplot as plt

     # Sample data
     data = {'Category': ['A', 'A', 'B', 'B', 'C', 'C'],
             'Value': [10, 20, 10, 30, 20, 40]}
     df = pd.DataFrame(data)

     # Plotting bar chart for categorical - numerical relationship
     sns.barplot(x='Category', y='Value', data=df)
     plt.title('Bar Chart of Categories and Values')
     plt.xlabel('Category')
     plt.ylabel('Value')
     plt.show()
     ```










