<a href="https://colab.research.google.com/github/sreesanthrnair/DSA_Notes/blob/main/Reading_data_from_various_sources%2CArray%2CDataframe%2Cvectors%2Cseries%2CIntroduction_to_pyspark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



##  Reading Data from Various Sources

Data can come from many places—files, databases, APIs, or distributed systems. Here's how to handle them:

###  Common Sources & Tools

| Source Type     | Tool/Library        | Method/Function |
|-----------------|---------------------|------------------|
| **CSV/Excel**    | `pandas`, `openpyxl` | `pd.read_csv()`, `pd.read_excel()` |
| **SQL Databases**| `SQLAlchemy`, `sqlite3`, `pandas` | `pd.read_sql_query()` |
| **JSON/XML**     | `json`, `xml.etree`, `pandas` | `pd.read_json()`, `ElementTree` |
| **Web APIs**     | `requests`, `urllib` | `requests.get()`, `json.loads()` |
| **Big Data**     | `PySpark`, `Dask`    | `spark.read.csv()`, `dask.read_csv()` |

---

##  Arrays, DataFrames, Vectors, Series

Understanding these structures helps you manipulate and analyze data efficiently.

###  1. Arrays (NumPy)
- **Definition**: Homogeneous, multi-dimensional data structure
- **Library**: `numpy`
- **Use Case**: Mathematical operations, matrix manipulation
```python
import numpy as np
arr = np.array([1, 2, 3])
```

###  2. Series (pandas)
- **Definition**: One-dimensional labeled array
- **Library**: `pandas`
- **Use Case**: Time series, single column operations
```python
import pandas as pd
s = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
```

###  3. DataFrame (pandas)
- **Definition**: Two-dimensional labeled data structure
- **Library**: `pandas`
- **Use Case**: Tabular data analysis, filtering, grouping
```python
df = pd.DataFrame({'Name': ['A', 'B'], 'Score': [85, 90]})
```

###  4. Vectors (MLlib in PySpark)
- **Definition**: Dense or sparse vector used in machine learning
- **Library**: `pyspark.ml.linalg`
- **Use Case**: Feature representation for ML models
```python
from pyspark.ml.linalg import Vectors
v = Vectors.dense([1.0, 0.0, 3.0])
```

---

##  Introduction to PySpark

PySpark is the Python API for Apache Spark—a distributed computing engine for big data processing.

###  Why PySpark?
- Handles **large-scale data** across clusters
- Supports **SQL**, **streaming**, **ML**, and **graph processing**
- Integrates with **Hadoop**, **Hive**, **Kafka**, and **Delta Lake**

###  Core Concepts

| Concept        | Description |
|----------------|-------------|
| **SparkSession** | Entry point to Spark functionality |
| **RDD**          | Low-level resilient distributed dataset |
| **DataFrame**    | High-level abstraction for structured data |
| **Transformations** | Lazy operations (e.g., `filter`, `map`) |
| **Actions**       | Trigger execution (e.g., `collect`, `show`) |

###  Basic PySpark Workflow
```python
from pyspark.sql import SparkSession

# Start Spark
spark = SparkSession.builder.appName("MyApp").getOrCreate()

# Read CSV
df = spark.read.csv("data.csv", header=True, inferSchema=True)

# Show data
df.show()

# Filter and select
df.select("column1").filter(df["column2"] > 50).show()
```

---

##  Best Practices

- Use **pandas** for small to medium datasets
- Switch to **PySpark** for distributed or large-scale data
- Convert between pandas and PySpark using `toPandas()` and `spark.createDataFrame()`
- Validate schema and types before transformations




