# Statistics and Data Analytics for Data Science

In this notebook, we will explore fundamental statistical concepts and techniques that are essential for data science and exploratory data analysis (EDA). As a data scientist, understanding these concepts is crucial when carrying out exploratory analysis to find out about the related data that will be used in creating a predictive model.

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Set style for plots
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# Create sample data that mimics Lending Club Dataset
np.random.seed(42)
n_samples = 5000

data = {
    'loan_amnt': np.random.normal(15000, 10000, n_samples),
    'int_rate': np.random.normal(12, 4, n_samples),
    'annual_inc': np.random.normal(75000, 30000, n_samples),
    'dti': np.random.normal(15, 10, n_samples),
    'fico_score': np.random.normal(700, 50, n_samples),
    'emp_length': np.random.gamma(2, 2, n_samples),
    'loan_status': np.random.choice([0, 1], n_samples, p=[0.8, 0.2])  # 0: Fully Paid, 1: Charged Off
}

# Ensure no negative values
data['loan_amnt'] = np.abs(data['loan_amnt'])
data['annual_inc'] = np.abs(data['annual_inc'])
data['dti'] = np.abs(data['dti'])
data['fico_score'] = np.clip(data['fico_score'], 300, 850)
data['emp_length'] = np.clip(data['emp_length'], 0, 15)

# Create DataFrame
df = pd.DataFrame(data)

print("Sample Lending Club Dataset")
print(df.head())
print(f"\nDataset Shape: {df.shape}")
print(f"\nDataset Info:")
print(df.info())

## Introduction to Descriptive Statistics

Descriptive statistics are used to summarize and describe the main features of a dataset. They provide simple summaries about the sample and the measures. Together with simple graphics analysis, they form the basis of virtually every quantitative analysis of data.

In [None]:
# Basic descriptive statistics
print("Descriptive Statistics for the Lending Club Dataset")
print(df.describe())

# Additional statistics
for col in df.select_dtypes(include=[np.number]).columns:
    print(f"\n{col}:")
    print(f"  Mean: {df[col].mean():.2f}")
    print(f"  Median: {df[col].median():.2f}")
    print(f"  Mode: {df[col].mode()[0]:.2f}")
    print(f"  Std Dev: {df[col].std():.2f}")
    print(f"  Variance: {df[col].var():.2f}")
    print(f"  Skewness: {df[col].skew():.2f}")
    print(f"  Kurtosis: {df[col].kurtosis():.2f}")

## Measures of Central Tendency

Measures of central tendency are statistical measures that represent the center or typical value of a dataset. The three main measures are:

1. **Mean**: The arithmetic average of the data
2. **Median**: The middle value when data is arranged in order
3. **Mode**: The most frequently occurring value in the dataset

In [None]:
# Calculating measures of central tendency for key variables
central_tendency = df[['loan_amnt', 'int_rate', 'annual_inc', 'dti', 'fico_score']].describe()

print("Measures of Central Tendency:")
for col in ['loan_amnt', 'int_rate', 'annual_inc', 'dti', 'fico_score']:
    mean_val = df[col].mean()
    median_val = df[col].median()
    mode_val = df[col].mode()[0]
    print(f"\n{col}:")
    print(f"  Mean: {mean_val:.2f}")
    print(f"  Median: {median_val:.2f}")
    print(f"  Mode: {mode_val:.2f}")

# Visualizing central tendency
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
axes = axes.ravel()

for i, col in enumerate(['loan_amnt', 'int_rate', 'annual_inc', 'dti', 'fico_score']):
    axes[i].hist(df[col], bins=50, alpha=0.7, color='lightblue', edgecolor='black')
    
    mean_val = df[col].mean()
    median_val = df[col].median()
    mode_val = df[col].mode()[0]
    
    axes[i].axvline(mean_val, color='red', linestyle='--', label=f'Mean: {mean_val:.2f}')
    axes[i].axvline(median_val, color='green', linestyle='--', label=f'Median: {median_val:.2f}')
    axes[i].axvline(mode_val, color='orange', linestyle='--', label=f'Mode: {mode_val:.2f}')
    
    axes[i].set_title(f'{col} Distribution')
    axes[i].set_xlabel(col)
    axes[i].set_ylabel('Frequency')
    axes[i].legend()
    axes[i].grid(True, alpha=0.3)

# Remove empty subplot
axes[-1].set_visible(False)

plt.tight_layout()
plt.show()

## Measures of Dispersion

Measures of dispersion describe the spread or variability of the data. Key measures include:

1. **Range**: Difference between maximum and minimum values
2. **Variance**: Average of squared deviations from the mean
3. **Standard Deviation**: Square root of variance
4. **Interquartile Range (IQR)**: Difference between 75th and 25th percentiles

In [None]:
# Calculating measures of dispersion
print("Measures of Dispersion:")
for col in ['loan_amnt', 'int_rate', 'annual_inc', 'dti', 'fico_score']:
    range_val = df[col].max() - df[col].min()
    variance = df[col].var()
    std_dev = df[col].std()
    iqr = df[col].quantile(0.75) - df[col].quantile(0.25)
    cv = (df[col].std() / df[col].mean()) * 100  # Coefficient of variation
    
    print(f"\n{col}:")
    print(f"  Range: {range_val:.2f}")
    print(f"  Variance: {variance:.2f}")
    print(f"  Standard Deviation: {std_dev:.2f}")
    print(f"  IQR: {iqr:.2f}")
    print(f"  Coefficient of Variation: {cv:.2f}%")

# Visualizing dispersion with violin plots
fig, axes = plt.subplots(2, 3, figsize=(18, 10))
axes = axes.ravel()

for i, col in enumerate(['loan_amnt', 'int_rate', 'annual_inc', 'dti', 'fico_score']):
    sns.violinplot(y=df[col], ax=axes[i])
    axes[i].set_title(f'{col} - Violin Plot')
    axes[i].grid(True, alpha=0.3)

# Remove empty subplot
axes[-1].set_visible(False)

plt.tight_layout()
plt.show()

## Measures of Asymmetry

Skewness measures the asymmetry of the distribution of values. A distribution is said to be:

- **Symmetrical** (Skewness ≈ 0): The distribution is balanced on both sides of the center
- **Positively Skewed** (Skewness > 0): The tail extends to the right
- **Negatively Skewed** (Skewness < 0): The tail extends to the left

Kurtosis measures the 'tailedness' of the distribution:

- **Mesokurtic** (Kurtosis ≈ 3): Normal distribution
- **Leptokurtic** (Kurtosis > 3): Heavy tails (more outliers)
- **Platykurtic** (Kurtosis < 3): Light tails (fewer outliers)

In [None]:
# Analyzing skewness and kurtosis
print("Measure of Asymmetry (Skewness and Kurtosis):")
for col in ['loan_amnt', 'int_rate', 'annual_inc', 'dti', 'fico_score']:
    skewness = df[col].skew()
    kurtosis = df[col].kurtosis()
    
    print(f"\n{col}:")
    print(f"  Skewness: {skewness:.3f}")
    if abs(skewness) < 0.5:
        print(f"  Distribution: Approximately symmetric")
    elif skewness > 0.5:
        print(f"  Distribution: Positively skewed (right tail)")
    else:
        print(f"  Distribution: Negatively skewed (left tail)")
    
    print(f"  Kurtosis: {kurtosis:.3f}")
    if kurtosis > 0:
        print(f"  Distribution: Heavy tails (leptokurtic)")
    elif kurtosis < 0:
        print(f"  Distribution: Light tails (platykurtic)")
    else:
        print(f"  Distribution: Normal tails (mesokurtic)")

# Plotting histograms with skewness and kurtosis information
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
axes = axes.ravel()

for i, col in enumerate(['loan_amnt', 'int_rate', 'annual_inc', 'dti', 'fico_score']):
    axes[i].hist(df[col], bins=50, alpha=0.7, color='lightblue', edgecolor='black', density=True)
    
    skewness = df[col].skew()
    kurtosis = df[col].kurtosis()
    
    axes[i].set_title(f'{col} Distribution\nSkewness: {skewness:.3f}, Kurtosis: {kurtosis:.3f}')
    axes[i].set_xlabel(col)
    axes[i].set_ylabel('Density')
    axes[i].grid(True, alpha=0.3)

# Remove empty subplot
axes[-1].set_visible(False)

plt.tight_layout()
plt.show()

## Univariate Analysis

Univariate analysis involves examining one variable at a time to understand its distribution, central tendency, and dispersion. This is the foundation of statistical analysis and helps identify patterns, outliers, and the shape of data distribution.

In [None]:
# Univariate Analysis
print("Univariate Analysis Results:")

# Summary statistics for numerical variables
numerical_cols = ['loan_amnt', 'int_rate', 'annual_inc', 'dti', 'fico_score', 'emp_length']

for col in numerical_cols:
    print(f"\n{col.upper()}")
    print(f"  Mean: {df[col].mean():.2f}")
    print(f"  Median: {df[col].median():.2f}")
    print(f"  Std: {df[col].std():.2f}")
    print(f"  Min: {df[col].min():.2f}")
    print(f"  Max: {df[col].max():.2f}")
    print(f"  25%: {df[col].quantile(0.25):.2f}")
    print(f"  75%: {df[col].quantile(0.75):.2f}")
    print(f"  Skewness: {df[col].skew():.2f}")
    print(f"  Kurtosis: {df[col].kurtosis():.2f}")
    
# Visualizations for univariate analysis
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
axes = axes.ravel()

for i, col in enumerate(numerical_cols):
    axes[i].hist(df[col], bins=30, color='lightblue', edgecolor='black', alpha=0.7)
    axes[i].set_title(f'{col} Distribution')
    axes[i].set_xlabel(col)
    axes[i].set_ylabel('Frequency')
    axes[i].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Bivariate Analysis

Bivariate analysis examines the relationship between two variables. It helps understand correlations, associations, and dependencies between variables. Techniques include:

1. Correlation analysis
2. Scatter plots
3. Cross-tabulation for categorical variables
4. Box plots comparing categories

In [None]:
# Bivariate Analysis
print("Bivariate Analysis Results:")

# Correlation matrix
corr_cols = ['loan_amnt', 'int_rate', 'annual_inc', 'dti', 'fico_score', 'loan_status']
corr_matrix = df[corr_cols].corr()

plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0, square=True, fmt='.3f')
plt.title('Correlation Matrix - Numerical Variables')
plt.show()

# Strongest correlations
corrs = corr_matrix.unstack().sort_values(key=lambda x: abs(x), ascending=False)
corrs = corrs[corrs.index.get_level_values(0) != corrs.index.get_level_values(1)]  # Remove self-correlations
print("\nStrongest Correlations:")
print(corrs.head(10))

# Scatter plots for key relationships
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
axes = axes.ravel()

# 1. FICO Score vs Interest Rate
axes[0].scatter(df['fico_score'], df['int_rate'], alpha=0.5, color='blue')
axes[0].set_xlabel('FICO Score')
axes[0].set_ylabel('Interest Rate (%)')
axes[0].set_title('FICO Score vs Interest Rate')
axes[0].grid(True, alpha=0.3)

# 2. Annual Income vs Loan Amount
axes[1].scatter(df['annual_inc'], df['loan_amnt'], alpha=0.5, color='green')
axes[1].set_xlabel('Annual Income')
axes[1].set_ylabel('Loan Amount')
axes[1].set_title('Annual Income vs Loan Amount')
axes[1].grid(True, alpha=0.3)

# 3. DTI vs Interest Rate
axes[2].scatter(df['dti'], df['int_rate'], alpha=0.5, color='red')
axes[2].set_xlabel('Debt-to-Income Ratio')
axes[2].set_ylabel('Interest Rate (%)')
axes[2].set_title('DTI vs Interest Rate')
axes[2].grid(True, alpha=0.3)

# 4. FICO Score by Loan Status
sns.boxplot(data=df, x='loan_status', y='fico_score', ax=axes[3])
axes[3].set_title('FICO Score by Loan Status')
axes[3].grid(True, alpha=0.3)

# 5. Interest Rate by Loan Status
sns.boxplot(data=df, x='loan_status', y='int_rate', ax=axes[4])
axes[4].set_title('Interest Rate by Loan Status')
axes[4].grid(True, alpha=0.3)

# 6. Loan Amount by Loan Status
df.boxplot(column='loan_amnt', by='loan_status', ax=axes[5])
axes[5].set_title('Loan Amount by Loan Status')
axes[5].set_xlabel('Loan Status')
axes[5].set_ylabel('Loan Amount')
axes[5].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Conclusion

In this notebook, we've explored various fundamental concepts in statistics and data analytics that are crucial for data science:

1. **Descriptive Statistics**: We calculated and visualized measures of central tendency (mean, median, mode) and dispersion (variance, standard deviation, IQR) for our lending dataset.

2. **Measures of Central Tendency**: We determined which measure (mean vs median) is most appropriate based on the skewness of each distribution.

3. **Measures of Dispersion**: We calculated and visualized range, variance, standard deviation, and IQR.

4. **Measures of Asymmetry**: We analyzed skewness and kurtosis for each variable.

5. **Univariate Analysis**: We analyzed each variable individually to understand its distribution.

6. **Bivariate Analysis**: We examined relationships between pairs of variables using correlation and visualization.

These statistical techniques form the foundation for exploratory data analysis (EDA) and are essential for creating predictive models in data science. They help ensure that models are built on solid statistical understanding of the underlying data.