# Data Analysis Example

This notebook demonstrates basic data analysis techniques using a simple employee dataset. We'll explore the data, generate statistics, and create visualizations.

## 1. Import Required Libraries

Import pandas, matplotlib, and other necessary libraries for data analysis.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Set up matplotlib for inline plotting
%matplotlib inline

# Set style for better looking plots
plt.style.use('default')
sns.set_palette("husl")

## 2. Load the CSV Dataset

Use pandas to read the CSV file and load it into a DataFrame.

In [None]:
# Load the CSV file
df = pd.read_csv('sample_data.csv')

print("Dataset loaded successfully!")
print(f"Dataset shape: {df.shape}")
print(f"Columns: {list(df.columns)}")

## 3. Explore Dataset Structure

Display the first few rows, column names, data types, and shape of the dataset.

In [None]:
# Display first 5 rows
print("First 5 rows of the dataset:")
print(df.head())

print("\n" + "="*50 + "\n")

# Display basic information about the dataset
print("Dataset information:")
print(df.info())

print("\n" + "="*50 + "\n")

# Check for missing values
print("Missing values per column:")
print(df.isnull().sum())

## 4. Display Basic Statistics

Generate summary statistics using describe() and other methods to understand the data distribution.

In [None]:
# Summary statistics for numerical columns
print("Summary statistics for numerical columns:")
print(df.describe())

print("\n" + "="*50 + "\n")

# Value counts for categorical columns
print("Department distribution:")
print(df['Department'].value_counts())

print("\n" + "="*30 + "\n")

print("City distribution:")
print(df['City'].value_counts())

print("\n" + "="*30 + "\n")

# Calculate some additional statistics
print("Additional statistics:")
print(f"Average salary: ${df['Salary'].mean():,.2f}")
print(f"Median salary: ${df['Salary'].median():,.2f}")
print(f"Average age: {df['Age'].mean():.1f} years")

## 5. Data Visualization

Create basic plots such as histograms, scatter plots, and bar charts to visualize the data.

In [None]:
# Create a figure with multiple subplots
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# 1. Histogram of Age distribution
axes[0, 0].hist(df['Age'], bins=8, edgecolor='black', alpha=0.7)
axes[0, 0].set_title('Age Distribution')
axes[0, 0].set_xlabel('Age')
axes[0, 0].set_ylabel('Frequency')

# 2. Histogram of Salary distribution
axes[0, 1].hist(df['Salary'], bins=8, edgecolor='black', alpha=0.7, color='green')
axes[0, 1].set_title('Salary Distribution')
axes[0, 1].set_xlabel('Salary ($)')
axes[0, 1].set_ylabel('Frequency')

# 3. Bar chart of Department counts
dept_counts = df['Department'].value_counts()
axes[1, 0].bar(dept_counts.index, dept_counts.values, color='orange')
axes[1, 0].set_title('Number of Employees by Department')
axes[1, 0].set_xlabel('Department')
axes[1, 0].set_ylabel('Count')
axes[1, 0].tick_params(axis='x', rotation=45)

# 4. Scatter plot of Age vs Salary
axes[1, 1].scatter(df['Age'], df['Salary'], alpha=0.7, color='red')
axes[1, 1].set_title('Age vs Salary')
axes[1, 1].set_xlabel('Age')
axes[1, 1].set_ylabel('Salary ($)')

plt.tight_layout()
plt.show()

In [None]:
# Box plot showing salary distribution by department
plt.figure(figsize=(10, 6))
df.boxplot(column='Salary', by='Department', ax=plt.gca())
plt.title('Salary Distribution by Department')
plt.xlabel('Department')
plt.ylabel('Salary ($)')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

## 6. Filter and Query Data

Demonstrate filtering rows and selecting specific columns based on conditions.

In [None]:
# Filter employees with salary > $75,000
high_earners = df[df['Salary'] > 75000]
print("Employees with salary > $75,000:")
print(high_earners)

print("\n" + "="*50 + "\n")

# Filter engineering department employees
engineers = df[df['Department'] == 'Engineering']
print("Engineering department employees:")
print(engineers[['Name', 'Age', 'Salary']])

print("\n" + "="*50 + "\n")

# Multiple conditions: Young employees (age < 30) in Marketing
young_marketers = df[(df['Age'] < 30) & (df['Department'] == 'Marketing')]
print("Young employees (age < 30) in Marketing:")
print(young_marketers)

print("\n" + "="*50 + "\n")

# Group by department and calculate average salary
dept_avg_salary = df.groupby('Department')['Salary'].mean().sort_values(ascending=False)
print("Average salary by department:")
for dept, avg_sal in dept_avg_salary.items():
    print(f"{dept}: ${avg_sal:,.2f}")

print("\n" + "="*50 + "\n")

# Select specific columns
selected_data = df[['Name', 'Department', 'Salary']].sort_values('Salary', ascending=False)
print("Top 5 highest paid employees:")
print(selected_data.head())

## Conclusion

This notebook demonstrated basic data analysis techniques including:
- Loading CSV data with pandas
- Exploring dataset structure and properties
- Calculating summary statistics
- Creating various types of visualizations
- Filtering and querying data with conditions

The sample employee dataset shows patterns such as:
- Engineering department has the highest average salary
- Most employees are between 26-42 years old
- Salary ranges from $65,000 to $85,000

This foundation can be extended for more complex analyses including correlation analysis, predictive modeling, and advanced visualizations.