# Exploratory Data Analysis (EDA) - Red Wine Quality Dataset

## Introduction

**Exploratory Data Analysis (EDA)** is a critical first step in any data science project. It involves analyzing datasets to summarize their main characteristics, often using statistical graphics and visualization methods. EDA helps us understand the data, discover patterns, spot anomalies, test hypotheses, and check assumptions.

In this notebook, we'll perform a comprehensive EDA on the **Red Wine Quality dataset**, which contains physicochemical properties and quality ratings of red wine samples.

### What You'll Learn

1. **Data Loading and Initial Exploration**
2. **Understanding Data Structure and Types**
3. **Statistical Summary and Distribution Analysis**
4. **Missing Value Detection and Handling**
5. **Univariate Analysis** (analyzing individual variables)
6. **Bivariate Analysis** (analyzing relationships between variables)
7. **Multivariate Analysis** (analyzing multiple variables together)
8. **Correlation Analysis**
9. **Outlier Detection**
10. **Data Visualization Techniques**

### Why EDA is Important

- **Understand Your Data**: Know what you're working with
- **Data Quality Check**: Identify missing values, duplicates, and errors
- **Feature Understanding**: Learn which features are important
- **Pattern Discovery**: Find relationships and trends
- **Hypothesis Generation**: Form ideas for modeling
- **Communication**: Create visualizations to share insights

## 1. Import Necessary Libraries

Before we begin, let's import all the libraries we'll need for our analysis.

In [None]:
# Data manipulation and analysis
import pandas as pd
import numpy as np

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Statistical analysis
from scipy import stats
from scipy.stats import skew, kurtosis

# Ignore warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

# Set visualization style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 11

print("Libraries imported successfully!")
print("=" * 60)
print("Pandas version:", pd.__version__)
print("NumPy version:", np.__version__)
print("Matplotlib version:", plt.matplotlib.__version__)
print("Seaborn version:", sns.__version__)
print("=" * 60)

## 2. Load and Initial Data Exploration

### About the Red Wine Quality Dataset

The Red Wine Quality dataset contains information about Portuguese "Vinho Verde" red wine samples. Each row represents a wine sample with various physicochemical properties and a quality rating.

**Features:**
1. **fixed acidity**: Non-volatile acids (tartaric acid) in g/dm³
2. **volatile acidity**: Volatile acids (acetic acid) that can cause unpleasant vinegar taste in g/dm³
3. **citric acid**: Adds freshness and flavor to wines in g/dm³
4. **residual sugar**: Sugar remaining after fermentation in g/dm³
5. **chlorides**: Amount of salt in the wine in g/dm³
6. **free sulfur dioxide**: Free form of SO₂ that prevents microbial growth in mg/dm³
7. **total sulfur dioxide**: Total SO₂ (free + bound forms) in mg/dm³
8. **density**: Density of wine in g/cm³
9. **pH**: Acidity level (0-14 scale, most wines are 3-4)
10. **sulphates**: Wine additive that contributes to SO₂ levels in g/dm³
11. **alcohol**: Alcohol content in % vol

**Target Variable:**
- **quality**: Quality rating from 0 (very bad) to 10 (excellent) - scored by wine experts

In [None]:
# Load the dataset
# Note: Download from https://archive.ics.uci.edu/ml/datasets/wine+quality
# For this example, we'll create a sample dataset

# Create sample data for demonstration
np.random.seed(42)
n_samples = 1599

# Generate sample wine data with realistic distributions
data = {
    'fixed acidity': np.random.normal(8.32, 1.74, n_samples),
    'volatile acidity': np.random.gamma(2, 0.26, n_samples),
    'citric acid': np.random.beta(2, 2, n_samples) * 0.8,
    'residual sugar': np.random.exponential(2.5, n_samples),
    'chlorides': np.random.gamma(2, 0.044, n_samples),
    'free sulfur dioxide': np.random.gamma(2.5, 6.2, n_samples),
    'total sulfur dioxide': np.random.gamma(3, 15, n_samples),
    'density': np.random.normal(0.9967, 0.0019, n_samples),
    'pH': np.random.normal(3.31, 0.15, n_samples),
    'sulphates': np.random.gamma(3, 0.22, n_samples),
    'alcohol': np.random.normal(10.42, 1.08, n_samples),
}

# Generate quality scores (3-8, with most around 5-6)
quality_probs = [0.01, 0.03, 0.43, 0.40, 0.12, 0.01]
data['quality'] = np.random.choice(range(3, 9), n_samples, p=quality_probs)

# Create DataFrame
df = pd.DataFrame(data)

# Ensure realistic bounds
df['fixed acidity'] = df['fixed acidity'].clip(4, 16)
df['volatile acidity'] = df['volatile acidity'].clip(0.12, 1.58)
df['citric acid'] = df['citric acid'].clip(0, 1)
df['residual sugar'] = df['residual sugar'].clip(0.9, 15.5)
df['chlorides'] = df['chlorides'].clip(0.012, 0.611)
df['free sulfur dioxide'] = df['free sulfur dioxide'].clip(1, 72)
df['total sulfur dioxide'] = df['total sulfur dioxide'].clip(6, 289)
df['density'] = df['density'].clip(0.99007, 1.00369)
df['pH'] = df['pH'].clip(2.74, 4.01)
df['sulphates'] = df['sulphates'].clip(0.33, 2.0)
df['alcohol'] = df['alcohol'].clip(8.4, 14.9)

print("Dataset loaded successfully!")
print("=" * 70)

### 2.1 First Look at the Data

The `.head()` method shows us the first few rows of the dataset, giving us an initial glimpse of what the data looks like.

In [None]:
# Display first few rows
print("First 10 rows of the dataset:")
print("=" * 70)
display(df.head(10))

print("\n" + "=" * 70)
print("Last 5 rows of the dataset:")
print("=" * 70)
display(df.tail())

### 2.2 Dataset Shape and Structure

Understanding the size and structure of your dataset is crucial for planning your analysis.

In [None]:
# Dataset shape
print("=" * 70)
print("DATASET SHAPE AND STRUCTURE")
print("=" * 70)
print(f"Number of rows (samples): {df.shape[0]:,}")
print(f"Number of columns (features): {df.shape[1]}")
print(f"Total data points: {df.shape[0] * df.shape[1]:,}")

print("\n" + "=" * 70)
print("COLUMN NAMES:")
print("=" * 70)
for i, col in enumerate(df.columns, 1):
    print(f"{i:2d}. {col}")

print("\n" + "=" * 70)
print("DATA TYPES:")
print("=" * 70)
print(df.dtypes)

print("\n" + "=" * 70)
print("MEMORY USAGE:")
print("=" * 70)
print(df.memory_usage(deep=True))

### 2.3 Dataset Information

The `.info()` method provides a concise summary of the DataFrame including data types, non-null counts, and memory usage.

In [None]:
# Detailed information about the dataset
print("=" * 70)
print("DETAILED DATASET INFORMATION")
print("=" * 70)
df.info()

print("\n" + "=" * 70)
print("KEY OBSERVATIONS:")
print("=" * 70)
print(f"✓ All features are numeric (float64 or int64)")
print(f"✓ No missing values detected in any column")
print(f"✓ Dataset is ready for statistical analysis")
print("=" * 70)

## 3. Statistical Summary

Statistical summaries help us understand the central tendency, spread, and distribution of our data.

In [None]:
# Statistical summary
print("=" * 70)
print("STATISTICAL SUMMARY OF ALL FEATURES")
print("=" * 70)
summary_stats = df.describe().T
summary_stats['range'] = summary_stats['max'] - summary_stats['min']
summary_stats['iqr'] = summary_stats['75%'] - summary_stats['25%']
display(summary_stats)

print("\n" + "=" * 70)
print("ADDITIONAL STATISTICS")
print("=" * 70)

# Skewness and Kurtosis for each feature
additional_stats = pd.DataFrame({
    'Skewness': df.skew(),
    'Kurtosis': df.kurtosis()
})
display(additional_stats)

print("\nInterpretation:")
print("-" * 70)
print("• Skewness: Measures asymmetry of distribution")
print("  - Near 0: Symmetric distribution")
print("  - Positive: Right-skewed (tail on right)")
print("  - Negative: Left-skewed (tail on left)")
print("\n• Kurtosis: Measures tailedness of distribution")
print("  - Near 0: Normal distribution")
print("  - Positive: Heavy tails (more outliers)")
print("  - Negative: Light tails (fewer outliers)")
print("=" * 70)

## 4. Missing Values and Data Quality Check

Checking for missing values is crucial as they can significantly impact our analysis and model performance.

In [None]:
# Check for missing values
print("=" * 70)
print("MISSING VALUES ANALYSIS")
print("=" * 70)

missing_data = pd.DataFrame({
    'Column': df.columns,
    'Missing_Count': df.isnull().sum(),
    'Missing_Percentage': (df.isnull().sum() / len(df)) * 100
})
missing_data = missing_data[missing_data['Missing_Count'] > 0].sort_values('Missing_Count', ascending=False)

if len(missing_data) == 0:
    print("✓ No missing values found in the dataset!")
else:
    display(missing_data)

# Check for duplicates
print("\n" + "=" * 70)
print("DUPLICATE ROWS CHECK")
print("=" * 70)
n_duplicates = df.duplicated().sum()
print(f"Number of duplicate rows: {n_duplicates}")
if n_duplicates > 0:
    print(f"Percentage of duplicates: {(n_duplicates/len(df))*100:.2f}%")
else:
    print("✓ No duplicate rows found!")

# Data quality summary
print("\n" + "=" * 70)
print("DATA QUALITY SUMMARY")
print("=" * 70)
print(f"✓ Dataset Completeness: {((df.size - df.isnull().sum().sum()) / df.size * 100):.2f}%")
print(f"✓ Total data points: {df.size:,}")
print(f"✓ Missing data points: {df.isnull().sum().sum()}")
print("=" * 70)

## 5. Target Variable Analysis (Wine Quality)

Understanding our target variable is crucial before analyzing features.

In [None]:
# Analyze quality distribution
print("=" * 70)
print("WINE QUALITY DISTRIBUTION ANALYSIS")
print("=" * 70)

quality_counts = df['quality'].value_counts().sort_index()
quality_percentages = (quality_counts / len(df) * 100).round(2)

quality_summary = pd.DataFrame({
    'Count': quality_counts,
    'Percentage': quality_percentages
})

print("\nQuality Score Distribution:")
display(quality_summary)

print("\nStatistics:")
print(f"Mean Quality: {df['quality'].mean():.2f}")
print(f"Median Quality: {df['quality'].median():.0f}")
print(f"Mode Quality: {df['quality'].mode()[0]}")
print(f"Quality Range: {df['quality'].min()} - {df['quality'].max()}")

# Visualization
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Count plot
sns.countplot(data=df, x='quality', palette='viridis', ax=axes[0])
axes[0].set_title('Distribution of Wine Quality Scores', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Quality Score', fontsize=12)
axes[0].set_ylabel('Count', fontsize=12)
axes[0].grid(axis='y', alpha=0.3)

# Add count labels
for container in axes[0].containers:
    axes[0].bar_label(container)

# Pie chart
axes[1].pie(quality_counts, labels=quality_counts.index, autopct='%1.1f%%',
            startangle=90, colors=sns.color_palette('viridis', len(quality_counts)))
axes[1].set_title('Quality Score Distribution (%)', fontsize=14, fontweight='bold')

# Box plot
sns.boxplot(y=df['quality'], palette='Set2', ax=axes[2])
axes[2].set_title('Box Plot of Quality Scores', fontsize=14, fontweight='bold')
axes[2].set_ylabel('Quality Score', fontsize=12)
axes[2].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

print("\n" + "=" * 70)
print("KEY INSIGHTS:")
print("=" * 70)
print(f"• Most wines are rated 5 or 6 (medium quality)")
print(f"• Very few wines receive extreme ratings (3 or 8)")
print(f"• The distribution is roughly normal, centered around 5-6")
print("=" * 70)

## 6. Correlation Analysis

Correlation analysis helps us understand relationships between variables. This is crucial for feature selection and understanding which factors influence wine quality.

In [None]:
# Calculate correlation matrix
correlation_matrix = df.corr()

# Correlation with quality
quality_corr = correlation_matrix['quality'].sort_values(ascending=False)
print("=" * 70)
print("CORRELATION WITH WINE QUALITY")
print("=" * 70)
print(quality_corr)

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(18, 7))

# Heatmap of all correlations
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm', 
            center=0, square=True, linewidths=1, ax=axes[0], cbar_kws={'label': 'Correlation'})
axes[0].set_title('Correlation Heatmap - All Features', fontsize=14, fontweight='bold')

# Bar plot of correlations with quality
quality_corr_df = quality_corr.drop('quality').to_frame()
quality_corr_df.plot(kind='barh', ax=axes[1], color=quality_corr_df['quality'].apply(
    lambda x: 'green' if x > 0 else 'red'), alpha=0.7, legend=False)
axes[1].set_title('Feature Correlation with Wine Quality', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Correlation Coefficient', fontsize=12)
axes[1].set_ylabel('Features', fontsize=12)
axes[1].axvline(x=0, color='black', linestyle='--', linewidth=1)
axes[1].grid(axis='x', alpha=0.3)

plt.tight_layout()
plt.show()

print("\n" + "=" * 70)
print("KEY INSIGHTS:")
print("=" * 70)
print("Positive Correlations with Quality (Good indicators):")
for feature, corr in quality_corr[quality_corr > 0.1].items():
    if feature != 'quality':
        print(f"  • {feature}: {corr:.3f}")

print("\nNegative Correlations with Quality (Bad indicators):")
for feature, corr in quality_corr[quality_corr < -0.1].items():
    print(f"  • {feature}: {corr:.3f}")

print("\n" + "=" * 70)
print("INTERPRETATION:")
print("=" * 70)
print("• Correlation ranges from -1 to +1")
print("• +1: Perfect positive correlation")
print("• -1: Perfect negative correlation")
print("•  0: No linear correlation")
print("• |corr| > 0.7: Strong correlation")
print("• |corr| 0.4-0.7: Moderate correlation")
print("• |corr| < 0.4: Weak correlation")
print("=" * 70)

## 7. Distribution Analysis and Visualization

Understanding the distribution of each feature helps us identify patterns, outliers, and prepare for modeling.

### 7.1 Distribution Plots for All Features

In [None]:
# Plot distributions for all numeric features
numeric_features = df.select_dtypes(include=[np.number]).columns.drop('quality')

fig, axes = plt.subplots(4, 3, figsize=(18, 16))
axes = axes.ravel()

for idx, col in enumerate(numeric_features):
    # Histogram with KDE
    sns.histplot(df[col], kde=True, ax=axes[idx], color='skyblue', edgecolor='black')
    axes[idx].set_title(f'Distribution of {col}', fontsize=12, fontweight='bold')
    axes[idx].set_xlabel(col, fontsize=10)
    axes[idx].set_ylabel('Frequency', fontsize=10)
    axes[idx].grid(axis='y', alpha=0.3)
    
    # Add mean and median lines
    mean_val = df[col].mean()
    median_val = df[col].median()
    axes[idx].axvline(mean_val, color='red', linestyle='--', linewidth=2, label=f'Mean: {mean_val:.2f}')
    axes[idx].axvline(median_val, color='green', linestyle='--', linewidth=2, label=f'Median: {median_val:.2f}')
    axes[idx].legend(fontsize=8)

plt.tight_layout()
plt.show()

print("=" * 70)
print("DISTRIBUTION INSIGHTS:")
print("=" * 70)
for col in numeric_features:
    skew_val = df[col].skew()
    if abs(skew_val) < 0.5:
        dist_type = "approximately symmetric"
    elif skew_val > 0:
        dist_type = "right-skewed (positive skew)"
    else:
        dist_type = "left-skewed (negative skew)"
    print(f"• {col:25s}: {dist_type} (skewness: {skew_val:6.2f})")
print("=" * 70)

## 8. Outlier Detection

Outliers can significantly impact statistical analyses and machine learning models. Let's identify them using box plots.

In [None]:
# Box plots for outlier detection
fig, axes = plt.subplots(4, 3, figsize=(18, 16))
axes = axes.ravel()

for idx, col in enumerate(numeric_features):
    sns.boxplot(y=df[col], ax=axes[idx], color='lightcoral')
    axes[idx].set_title(f'Box Plot: {col}', fontsize=12, fontweight='bold')
    axes[idx].set_ylabel(col, fontsize=10)
    axes[idx].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

# Calculate outliers using IQR method
print("=" * 70)
print("OUTLIER ANALYSIS (Using IQR Method)")
print("=" * 70)
print("\nNumber and percentage of outliers per feature:")
print("-" * 70)

for col in numeric_features:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]
    n_outliers = len(outliers)
    pct_outliers = (n_outliers / len(df)) * 100
    
    print(f"{col:25s}: {n_outliers:4d} outliers ({pct_outliers:5.2f}%)")

print("=" * 70)
print("\nNote: Outliers aren't necessarily errors - they may represent")
print("legitimate extreme values that should be investigated.")
print("=" * 70)

## 9. Summary and Key Findings

### EDA Summary

In this comprehensive Exploratory Data Analysis of the Red Wine Quality dataset, we've discovered:

**Dataset Characteristics:**
- 1,599 wine samples with 11 physicochemical features + 1 quality rating
- All numeric features, no missing values
- Quality scores range from 3 to 8, with most wines rated 5 or 6

**Key Findings:**

1. **Quality Distribution**
   - Most wines receive medium ratings (5-6)
   - Very few excellent (8) or poor (3) quality wines
   - This suggests quality prediction will be challenging due to class imbalance

2. **Feature Correlations with Quality**
   - Strongest positive correlations: Features that improve quality
   - Strongest negative correlations: Features that decrease quality
   - Some features show little correlation with quality

3. **Distribution Patterns**
   - Several features show right-skewed distributions
   - Some features are approximately normal
   - Skewed features may benefit from transformation

4. **Out liers**
   - Multiple features contain outliers
   - Outliers may represent premium or faulty wines
   - Need careful handling in modeling phase

**Next Steps:**

1. **Feature Engineering**: Create new features from existing ones
2. **Feature Selection**: Choose most important features
3. **Data Preprocessing**: Handle outliers and scale features
4. **Model Building**: Train machine learning models for quality prediction
5. **Model Evaluation**: Assess performance and iterate

### What We Learned About EDA:

- **Data understanding is crucial** before modeling
- **Visualization reveals patterns** that statistics alone might miss
- **Correlation doesn't imply causation** - always investigate relationships
- **Outliers require investigation** - they're not always errors
- **Domain knowledge matters** - understanding wine chemistry helps interpretation

This EDA provides a solid foundation for building predictive models and generating actionable insights about wine quality!