# Python for Data Analysis & Visualization

##  Learning Objectives

By the end of this chapter, you will be able to:

- ✅ Use all components of the **Python Data Science Stack**
- ✅ Manipulate data using **Pandas DataFrames**
- ✅ Create simple plots using **Pandas** and **Matplotlib**

---

##  Tools and Libraries Covered

- **NumPy** – Numerical computing
- **Pandas** – Data manipulation
- **Matplotlib** – Data visualization
- **Seaborn** – Statistical plotting
- **IPython** – Interactive computing
- **Jupyter Notebook** – Interactive development environment

---

##  Visualizations

- Basic plotting with **Matplotlib**
- Enhanced plotting with **Seaborn**

---



# 🧠 Introduction to the Python Data Science Stack

##  What is the Python Data Science Stack?

The **Python data science stack** is an informal term for a collection of libraries commonly used to solve data analysis and data science problems. While there is no fixed list, several core libraries are widely adopted for their powerful features and interoperability.

In this chapter, we focus on using the stack to manipulate **tabular data**—a foundational skill before transitioning to **large-scale data** or big data tools.

---

##  Why Python for Data Science?

- Python offers **a vast ecosystem of libraries and packages**
- Over **130,000+ packages** are available on **PyPI**
- These packages simplify numerical computation, data manipulation, visualization, and interactive computing

---

##  Core Components of the Data Science Stack

| Library         | Purpose                                                   |
|-----------------|-----------------------------------------------------------|
| **NumPy**       | Numerical computing and array manipulation                |
| **pandas**      | Data analysis and manipulation with DataFrames            |
| **SciPy**       | Advanced math algorithms built on top of NumPy            |
| **Matplotlib**  | Data visualization and basic plotting                     |
| **IPython**     | Enhanced interactive Python shell                         |
| **Jupyter**     | Web-based notebook for interactive computing              |

---



##  IPython: A Powerful Interactive Shell

[IPython](https://ipython.org/) is an enhanced interactive Python shell that helps you test and explore ideas quickly without creating and running full script files.

###  Why Use IPython?

Compared to the standard Python shell, IPython offers:

-  **Persistent Input History**  
  Reuse commands even after restarting the shell.

-  **Tab Completion**  
  Auto-complete functions, variables, and commands.

-  **Magic Commands**  
  Special commands (starting with `%` or `%%`) to boost productivity, like reloading changed modules without restarting the shell.

-  **Syntax Highlighting**  
  Makes code more readable with colored syntax.

> Ideal for rapid experimentation and a better coding experience!


## Exercise 1: Interacting with the Python Shell Using the IPython Commands
![image.png](attachment:a8c04ecc-cecd-4a35-a962-0115c294b8b7.png)

![image.png](attachment:0c66e121-8935-4f53-a6a6-235a36e6a887.png)

![image.png](attachment:7962ffcd-e753-4899-8d2c-1b11a38556d8.png)

---


## Jupyter Notebook Overview

The **Jupyter Notebook** is a web-based interactive environment for writing and running code. Originally part of IPython, it became a standalone project in version 4.

### Key Features

- Runs directly in your **web browser**
- Supports **code**, **text**, **graphs**, and **images**
- Allows **sharing** notebooks over the internet
- Supports **over 40 kernels**, including Python, R, and Julia
- Widely used in **data science**, **education**, and **research**

### What is a Kernel?

A **kernel** is the computation engine that executes the code in your notebook. For example, the **IPython kernel** runs Python code.

### Cells in Jupyter

A **notebook** is made up of **cells**. Each cell can be one of two types:

- **Code cell**: Executes code and displays the output
- **Markdown cell**: Contains formatted text using Markdown

### Working with Cells

Cells can be in one of two states:

- **Edit mode**: You can write or modify the content of the cell
- **Run mode**: The cell is ready to be executed

This combination of code and narrative makes Jupyter an excellent tool for both development and presentation.

## Exercise 2: Getting Started with the Jupyter Notebook
to run type: "jupyter notebook" on cmd

![image.png](attachment:0a6b9799-c419-4500-aff3-a938d3d21317.png)

---

In [5]:
x = 2
print(x*2)

4


In [8]:
def mean(a,b):
    return(a+b)/2

In [9]:
mean(10,3)

6.5

## IPython or Jupyter?

Both **IPython** and **Jupyter** play important roles in a data analysis workflow.

### When to Use IPython

- Best for quick experimentation
- Useful for debugging scripts
- Ideal for running asynchronous or data-heavy tasks
- Supports graphs, though less suited for presentation

### When to Use Jupyter Notebook

- Great for combining **code, text, and visuals**
- Ideal for **presenting results** and creating **visual reports**
- More natural environment for displaying **graphs and plots**

Most examples in this course will use **Jupyter Notebooks**, but the instructions are generally applicable to **IPython** as well.

---

## Activity 1

In [10]:
import numpy as np

def square_plus(x, c):
    return np.power(x,2) + c

In [11]:
x = 10
c = 100

result = square_plus(x, c)
print(result)

200


# Python Data Science Libraries Guide

## Overview
This guide covers the essential Python libraries for data science, providing a simplified introduction to each library and their key features.

## 1. NumPy - Numerical Computing

### What is NumPy?
NumPy is the foundation library for numerical computing in Python. It provides powerful tools for working with arrays and mathematical operations.

### Key Features:
- **Multidimensional arrays**: Work with 1D, 2D, 3D, and higher dimensional arrays
- **Mathematical operations**: Linear algebra, statistics, and matrix operations
- **Performance**: Fast operations through C/C++/Fortran integration
- **Foundation**: Used by pandas, scikit-learn, and many other libraries

### Installation and Import:

```python
import numpy as np
```

### Basic Usage:

```python
# Create arrays
arr = np.array([1, 2, 3, 4, 5])
matrix = np.array([[1, 2], [3, 4]])

# Array operations
arr * 2  # Multiply all elements by 2
np.mean(arr)  # Calculate mean
np.sum(matrix)  # Sum all elements
```

---

## 2. SciPy - Scientific Computing Ecosystem

### What is SciPy?
SciPy is both an ecosystem of scientific libraries and a specific library containing advanced mathematical functions.

### Key Features:
- **Ecosystem**: Includes NumPy, pandas, scikit-learn, and more
- **Advanced functions**: Optimization, integration, interpolation
- **Scientific computing**: Tools for mathematics, science, and engineering

### Common Use Cases:
- Statistical analysis
- Signal processing
- Image processing
- Optimization problems

---

## 3. Matplotlib - Data Visualization

### What is Matplotlib?
Matplotlib is the primary plotting library for creating 2D graphs and visualizations in Python.

### Key Features:
- **Flexible plotting**: Line plots, bar charts, scatter plots, histograms
- **Multiple formats**: Save as PNG, PDF, SVG, and more
- **MATLAB-like interface**: Familiar syntax for MATLAB users
- **Integration**: Works with NumPy arrays and pandas DataFrames

### Installation and Import:

```python
import matplotlib.pyplot as plt
```

### Basic Usage:

```python
# Simple line plot
x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]
plt.plot(x, y)
plt.title("My First Plot")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.show()
```

---

## 4. Pandas - Data Manipulation and Analysis

### What is Pandas?
Pandas is the go-to library for data manipulation and analysis, designed for working with structured data like Excel files and SQL tables.

### Key Data Structures:
- **Series**: 1D labeled array (like a column in Excel)
- **DataFrame**: 2D labeled data structure (like a complete Excel sheet)

### Key Features:
- **Data reading**: Read CSV, Excel, JSON, SQL databases
- **Data cleaning**: Handle missing values, duplicates
- **Data manipulation**: Filter, group, merge, and transform data
- **Time series**: Powerful datetime handling
- **SQL-like operations**: GroupBy, joins, indexing

### Installation and Import:

```python
import pandas as pd
```

### Basic Operations:

#### Reading Data:

```python
# Read CSV file
df = pd.read_csv('data.csv')

# Read from URL
df = pd.read_csv('https://example.com/data.csv')

# Read Excel file
df = pd.read_excel('data.xlsx')
```

#### Data Exploration:

```python
# View first few rows
df.head()

# Get basic info
df.info()

# Statistical summary
df.describe()

# Check data types
df.dtypes
```

#### Data Selection and Filtering:

```python
# Select column
df['column_name']

# Select multiple columns
df[['col1', 'col2']]

# Filter rows
df[df['age'] > 25]

# Filter with multiple conditions
df[(df['age'] > 25) & (df['city'] == 'New York')]
```

#### Data Manipulation:

```python
# Add new column
df['new_column'] = df['col1'] + df['col2']

# Drop column
df.drop('column_name', axis=1)

# Group by and aggregate
df.groupby('category').sum()

# Sort data
df.sort_values('column_name')
```

#### Handling Missing Data:

```python
# Check for missing values
df.isnull().sum()

# Drop rows with missing values
df.dropna()

# Fill missing values
df.fillna(0)
```

---

## 5. Integration and Workflow

### Typical Data Science Workflow:
1. **Import libraries** (NumPy, pandas, matplotlib)
2. **Read data** using pandas
3. **Explore and clean** data with pandas
4. **Perform calculations** with NumPy
5. **Visualize results** with matplotlib
6. **Apply machine learning** with scikit-learn (built on NumPy)

### Example Complete Workflow:

```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# 1. Read data
df = pd.read_csv('sales_data.csv')

# 2. Explore data
print(df.head())
print(df.info())

# 3. Clean data
df = df.dropna()
df['date'] = pd.to_datetime(df['date'])

# 4. Analyze data
monthly_sales = df.groupby(df['date'].dt.month)['sales'].sum()

# 5. Visualize results
plt.figure(figsize=(10, 6))
plt.plot(monthly_sales.index, monthly_sales.values)
plt.title('Monthly Sales Trend')
plt.xlabel('Month')
plt.ylabel('Sales')
plt.show()
```

---

## 6. Key Benefits and Use Cases

### Why Use These Libraries?
- **NumPy**: Fast numerical operations, foundation for other libraries
- **Pandas**: Easy data manipulation, handles real-world messy data
- **Matplotlib**: Create professional visualizations
- **SciPy**: Advanced scientific computing functions

### Common Applications:
- Data analysis and reporting
- Business intelligence
- Financial modeling
- Scientific research
- Machine learning preprocessing
- Data visualization and dashboards

---

## 7. Getting Started Tips

### Best Practices:
1. **Start with pandas** for data manipulation
2. **Use NumPy** for numerical computations
3. **Visualize early and often** with matplotlib
4. **Keep code organized** in Jupyter notebooks
5. **Document your analysis** with markdown cells

### Learning Path:
1. Master basic pandas operations (reading, filtering, grouping)
2. Learn NumPy for numerical operations
3. Create visualizations with matplotlib
4. Explore advanced features as needed
5. Practice with real datasets

---

## 8. Additional Resources

### Official Documentation:
- [NumPy Documentation](http://www.numpy.org)
- [Pandas Documentation](https://pandas.pydata.org)
- [Matplotlib Documentation](https://matplotlib.org)
- [SciPy Documentation](https://www.scipy.org)

### Next Steps:
After mastering these libraries, consider exploring:
- **Seaborn**: Statistical data visualization
- **Plotly**: Interactive visualizations
- **Scikit-learn**: Machine learning
- **Jupyter**: Interactive development environment

## Exercise 3: Reading Data with Pandas

In [12]:
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/TrainingByPackt/Big-Data-Analysis-with-Python/master/Lesson01/imports-85.csv")
df.head()

Unnamed: 0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.60,...,130,mpfi,3.47,2.68,9.00,111,5000,21,27,13495
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
1,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
2,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
3,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450
4,2,?,audi,gas,std,two,sedan,fwd,front,99.8,...,136,mpfi,3.19,3.4,8.5,110,5500,19,25,15250


### Pandas can read more formats:
• JSON
• Excel
• HTML
• HDF5
• Parquet (with PyArrow)
• SQL databases
• Google Big Query

---

## Exercise 4: Data Selection and the .loc Method

In [13]:
import numpy as np
import pandas as pd

In [17]:
url = "https://raw.githubusercontent.com/TrainingByPackt/Big-Data-Analysis-with-Python/master/Lesson01/RadNet_Laboratory_Analysis.csv"
df = pd.read_csv(url)
df.head()

Unnamed: 0,State,Location,Date Posted,Date Collected,Sample Type,Unit,Ba-140,Co-60,Cs-134,Cs-136,Cs-137,I-131,I-132,I-133,Te-129,Te-129m,Te-132,Ba-140.1
0,ID,Boise,03/30/2011,03/23/2011,Air Filter,pCi/m3,0.0,0.0,0,,0,0,0,0.0,,,0,
1,ID,Boise,03/30/2011,03/23/2011,Air Filter,pCi/m3,0.0,0.0,0,,0,0,0,0.0,,,0,
2,AK,Juneau,03/30/2011,03/23/2011,Air Filter,pCi/m3,0.0,0.0,0,,0,0,0,0.0,,,0,
3,AK,Nome,03/30/2011,03/22/2011,Air Filter,pCi/m3,0.0,0.0,0,,0,0,0,0.0,,,0,
4,AK,Nome,03/30/2011,03/23/2011,Air Filter,pCi/m3,0.0,0.0,0,,0,0,0,0.0,,,0,


In [15]:
df['State'].head()

0    ID
1    ID
2    AK
3    AK
4    AK
Name: State, dtype: object

In [18]:
df[df.State == "MN"]

Unnamed: 0,State,Location,Date Posted,Date Collected,Sample Type,Unit,Ba-140,Co-60,Cs-134,Cs-136,Cs-137,I-131,I-132,I-133,Te-129,Te-129m,Te-132,Ba-140.1
367,MN,St. Paul,04-08-2011,03/28/2011,Drinking Water,pCi/l,0.0,0.0,0,0.0,0,0,0,0.0,0.0,0.0,0,
368,MN,St. Paul,04/22/2011,04/13/2011,Drinking Water,pCi/l,0.0,0.0,0,0.0,0,0,0,0.0,0.0,0.0,0,
380,MN,Welch,04-08-2011,03/29/2011,Drinking Water,pCi/l,0.0,0.0,0,0.0,0,0,0,0.0,0.0,0.0,0,
381,MN,Welch,06-01-2011,04/14/2011,Drinking Water,pCi/l,0.0,0.0,0,0.0,0,0,0,0.0,0.0,0.0,0,
555,MN,St. Paul,04-04-2011,03/22/2011,Precipitation,pCi/l,0.0,0.0,0,,0,32,0,0.0,,,0,
556,MN,St. Paul,04-10-2011,03/29/2011,Precipitation,pCi/l,0.0,0.0,0,0.0,0,16,0,0.0,0.0,0.0,0,
557,MN,Welch,04-04-2011,03/17/2011,Precipitation,pCi/l,0.0,0.0,0,,0,0,0,0.0,,,0,
558,MN,Welch/510,04/13/2011,04-04-2011,Precipitation,pCi/l,0.0,0.0,0,0.0,0,9,0,0.0,0.0,0.0,0,


In [19]:
df[(df.State == 'CA') & (df['Sample Type'] == 'Drinking Water')]

Unnamed: 0,State,Location,Date Posted,Date Collected,Sample Type,Unit,Ba-140,Co-60,Cs-134,Cs-136,Cs-137,I-131,I-132,I-133,Te-129,Te-129m,Te-132,Ba-140.1
305,CA,Los Angeles,04-10-2011,04-04-2011,Drinking Water,pCi/l,0.0,0.0,0,0.0,0,0,0,0.0,0.0,0.0,0,
306,CA,Los Angeles,06-01-2011,04-12-2011,Drinking Water,pCi/l,0.0,0.0,0,0.0,0,0,0,0.0,0.0,0.0,0,
356,CA,Richmond,04-09-2011,03/29/2011,Drinking Water,pCi/l,0.0,0.0,0,0.0,0,0,0,0.0,0.0,0.0,0,
357,CA,Richmond,06-01-2011,04/13/2011,Drinking Water,pCi/l,0.0,0.0,0,0.0,0,0,0,0.0,0.0,0.0,0,


In [20]:
df[(df.State == "MN") ]["I-131"]

367     0
368     0
380     0
381     0
555    32
556    16
557     0
558     9
Name: I-131, dtype: int64

In [21]:
df.loc[df.State == "MN", "I-131"]
df[['I-132']].head()

Unnamed: 0,I-132
0,0
1,0
2,0
3,0
4,0
