# Pandas and Big Data (Beyond Pandas)
### 1. Introduction to Dask 
Dask and Vaex are Python libraries designed to handle large datasets that don’t fit into memory.
- **Dask**: Parallelizes Pandas operations to work with large datasets in chunks.


In [1]:
# Installing Dask
# !pip install dask
import dask.dataframe as dd
import pandas as pd

In [2]:
# Load a large CSV file using Dask
dask_df = dd.read_csv('data/local/large_file.csv')

# Perform basic operations like computing the mean of a column
mean_value = dask_df['A'].mean().compute()
print("Mean value of A:", mean_value)

# Group by a column and calculate the sum
grouped_sum = dask_df.groupby('B').sum().compute()
print(grouped_sum.head())


Mean value of A: 0.0024709422254980775
                  A         C         D         E
B                                                
-4.234288 -0.322151 -0.801109  0.927263 -1.061999
-4.132184  0.212574  0.317399  0.433281  0.312023
-4.029719  0.460794  0.783723 -1.180558  0.966778
-4.013216  1.567691  0.346469 -0.272435  1.533439
-3.916728  0.786227  2.086162 -1.018216 -0.343104


### 2. Handling Big Data in Pandas
Pandas can handle relatively large datasets, but it requires optimization and careful memory management.

In [3]:
# Sample CSV loading with memory optimization
types_dict = {
    'column1': 'int32',
    'column2': 'float32',
    'column3': 'category'
}

# Load data in chunks and process each chunk
chunk_size = 10000
chunks = pd.read_csv('data/local/large_file.csv', dtype=types_dict, chunksize=chunk_size)

for chunk in chunks:
    # Process each chunk
    print(chunk.head())


          A         B         C         D         E
0 -0.722498  0.538341 -0.095738  1.398888  0.062597
1  0.770041 -0.102221  1.012833  0.002186 -0.727072
2 -1.999852  1.081321  0.544163 -0.058429 -0.349761
3 -0.112298 -0.921257 -1.053475 -0.882685 -1.336251
4 -0.845955  0.351747 -0.728274 -0.562573 -1.311634
              A         B         C         D         E
10000  0.752074 -0.802763  1.104975 -1.570371 -0.212752
10001 -1.702163  0.368405 -0.657176 -0.083670 -0.150433
10002  0.306042  0.965945  0.090297  1.357273 -0.145185
10003  0.114770 -1.058163  0.426878 -0.943990  1.148841
10004  1.125774  0.601294 -0.924858 -0.046938  0.630849
              A         B         C         D         E
20000 -0.676919  0.460245 -2.231921  0.553191  0.256725
20001  0.958073  1.094360  1.271988 -1.359855 -1.702002
20002 -0.030398  0.721667  0.017222  0.678820 -1.562772
20003  1.233221  0.507481  0.710395  1.186792  1.398495
20004 -2.054455 -0.220579 -0.219640 -1.205752 -0.214547
              A 

### 3. Scaling Pandas Operations on Large Datasets
Scaling operations on large datasets with Dask to parallelize and optimize computations across multiple cores.

In [4]:
import time

# Load a large dataset using Dask
dask_large_df = dd.read_csv('data/local/large_file.csv')

# Measuring time for an operation with Pandas
start_time = time.time()
pandas_df = pd.read_csv('data/local/large_file.csv')
pandas_result = pandas_df.groupby('A').sum()
print(f"Pandas processing time: {time.time() - start_time} seconds")

# Measuring time for an operation with Dask
start_time = time.time()
dask_result = dask_large_df.groupby('A').sum().compute()
print(f"Dask processing time: {time.time() - start_time} seconds")

Pandas processing time: 0.1980609893798828 seconds
Dask processing time: 0.1929333209991455 seconds


### 4. Combining Pandas with Apache Spark for Big Data
Apache Spark is a distributed data processing engine that can scale to large datasets efficiently, and Pandas can be used to integrate and analyze smaller chunks of Spark data.

In [5]:
# Importing PySpark (Ensure you have installed pyspark with java)
from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName('BigData').getOrCreate()

# Load CSV into a Spark DataFrame
spark_df = spark.read.csv('data/local/large_file.csv', header=True, inferSchema=True)

# Show the schema and first few rows
spark_df.printSchema()
spark_df.show(5)

# Convert Spark DataFrame to Pandas DataFrame (for smaller data sizes)
pandas_df_from_spark = spark_df.limit(1000).toPandas()

# Performing some Pandas operations on the smaller dataset
print(pandas_df_from_spark.describe())

root
 |-- A: double (nullable = true)
 |-- B: double (nullable = true)
 |-- C: double (nullable = true)
 |-- D: double (nullable = true)
 |-- E: double (nullable = true)

+--------------------+--------------------+--------------------+--------------------+-------------------+
|                   A|                   B|                   C|                   D|                  E|
+--------------------+--------------------+--------------------+--------------------+-------------------+
| -0.7224981161400361|  0.5383413302411633|-0.09573776363361086|  1.3988884728158737| 0.0625965156347914|
|  0.7700405276674138|-0.10222058901775122|  1.0128334444560585|0.002185919898783305|-0.7270721625710608|
|  -1.999851512549912|  1.0813214913952787|  0.5441629289221556|-0.05842924126490761|-0.3497607171684442|
|-0.11229824892776323| -0.9212574430191187| -1.0534751072428974| -0.8826854237694995|-1.3362511503775372|
| -0.8459546251440426|  0.3517471630058665| -0.7282737309144859| -0.5625731127756347|-1

### 5. Case Studies on Big Data Handling with Pandas
In this section, we'll explore how to handle large datasets in real-world scenarios using combinations of Pandas, Dask, and Spark.

### Case Study: Log Data Analysis
#### Objective
Analyze server log data to identify usage patterns, errors, and optimize server performance.

#### Steps:
1. **Load Large Dataset Efficiently**: Use Dask to load and preprocess large log data (e.g., in CSV or Parquet format).
2. **Data Cleansing**: Handle missing values and filter logs based on criteria like status code, timestamp, or user ID.
3. **Aggregation and Summarization**: Aggregate data to find frequent paths, error occurrences, and peak times.
4. **Optimization**: Use Vaex to perform fast, memory-efficient operations on the cleaned data.
5. **Reporting**: Sample a subset of the data into Pandas for visualization and final reporting.

#### Tools & Libraries:
- **Dask** for data loading and processing.
- **Pandas** for detailed analysis and visualization.
- **Vaex** for efficient data manipulation and transformation.

By combining these tools, you can seamlessly handle large datasets, perform analyses, and generate insights from data that wouldn't fit in memory if processed with Pandas alone.
