# Week 6 - Exploratory Data Analysis (EDA) with Python

## Introduction to EDA
Exploratory Data Analysis is a critical step in the data analysis process which involves analyzing datasets to summarize their main characteristics, often using statistical graphics and other data visualization methods. It allows data analysts to uncover patterns, spot anomalies, test a hypothesis, or check assumptions with the help of summary statistics and graphical representations.

## Objectives:
- Understand the principles of Exploratory Data Analysis.
- Learn to conduct a basic EDA using Python.
- Familiarize with Python libraries like Pandas, NumPy, and Matplotlib for data analysis.

## Topics Covered:
- Data Ingestion
- Data Cleaning
- Univariate Analysis
- Bivariate and Multivariate Analysis
- Data Transformation and Feature Engineering
- Outlier Detection
- Use of Statistical Methods
- Data Visualization

## Activities:

### Data Ingestion and Cleaning:
```python
import pandas as pd

# Loading the dataset
df = pd.read_csv('dataset.csv')

## Data Profiling

In [None]:
# Inspecting the first few rows
print(df.head())

# View data types and non-null counts for each column
print(df.info())

# Descriptive statistics
print(df.describe())

## Variable Identification

In [None]:
# Categorize variables by type
categorical = df.select_dtypes(include=['object']).columns.tolist()
numerical = df.select_dtypes(include=['int64', 'float64']).columns.tolist()

## Data Cleaning

In [None]:
# Handling missing values
df.dropna(inplace=True)  # Drop rows with missing values
df.fillna(0, inplace=True)  # Fill missing values with zeros

# Correcting data types
df['column_name'] = df['column_name'].astype('int')  # Convert column to integer type

## Univariate Analysis

In [None]:
import matplotlib.pyplot as plt

# Histogram for numerical data
df['numerical_column'].hist(bins=50)
plt.show()

# Bar chart for categorical data
df['categorical_column'].value_counts().plot(kind='bar')
plt.show()

## Bivariate/Multivariate Analysis

In [None]:
# Scatter plot for numerical variable relationships
plt.scatter(df['numerical_column_1'], df['numerical_column_2'])
plt.xlabel('Column 1')
plt.ylabel('Column 2')
plt.show()

# Correlation matrix
correlation_matrix = df[numerical].corr()
print(correlation_matrix)

## Handling Outliers

In [None]:
# Box plot to visualize outliers
df.boxplot(column=['numerical_column'])
plt.show()

## Feature Engineering

In [None]:
# Create a new feature
df['new_feature'] = df['numerical_column_1'] / df['numerical_column_2']

## Data Transformation

In [None]:
# Log transformation
df['log_transformed'] = np.log(df['numerical_column'] + 1)

## Correlation Analysis

In [None]:
# Heatmap of correlation matrix
import seaborn as sns

sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.show()

## Documentation & Iterative Analysis

Ensure all steps and findings are well-documented, which is crucial for reproducibility and communication with others. The analysis should be iterative, refining techniques based on insights as they emerge.

## Conclusion

These techniques and visualizations form the backbone of EDA in Python. They enable the analyst to understand the data's structure, relationships, and patterns before proceeding to more complex analyses or building predictive models.