## Pandas Tutorial 15: Handling Large Datasets in Pandas
Often datasets that you load in pandas are very big and you may run out of memory. In this tutorial, we will cover some memory optimization tips in pandas.

#### Topics covered:
* **Introduction**
* **Loading only necessary data**
* **Optimizing data types**
* **Reading in chunks**
* **Memory profiling and tracking**

This tutorial will show you how to optimize the loading and processing of large datasets in pandas by utilizing selective loading, optimizing data types, and chunked reading, along with useful memory profiling techniques to ensure that you don't run out of memory.

In [1]:
import pandas as pd

## 1. Loading Only Necessary Data
To save memory, you can load only the columns or rows that you actually need. Let's say you only need data related to registered voters, abstentions, and expressed voters. Use the `usecols` parameter to specify the relevant columns.

In [2]:
# Load only specific columns and rows
df = pd.read_csv("voters.csv", usecols=['Registered', 'Abstentions','Voters', 'Expressed', 'Sex'], nrows=10000)

## 2. Optimizing Data Types
By default, pandas guesses the data types, but you can reduce memory usage by converting columns to more efficient types. For instance, string-based columns like 'Sex' can be converted to `category`, and integer columns like 'Registered' and 'Voters' can be downcasted.

In [3]:
# Convert to category type
df['Sex'] = df['Sex'].astype('category')

# Downcast numerical columns
df['Registered'] = pd.to_numeric(df['Registered'], downcast='integer')
df['Voters'] = pd.to_numeric(df['Voters'], downcast='integer')

## 3. Reading in Chunks
If the dataset is too large to load at once, you can process it in smaller chunks. This reduces memory usage and makes data processing more manageable.

In [5]:
# Read the file in chunks
chunk_iter = pd.read_csv("voters.csv", chunksize=100000)

for chunk in chunk_iter:
    # Process each chunk here (for example, summing up the 'Registered' column)
    print(chunk[['Registered']].sum())

  for chunk in chunk_iter:


Registered    61821838
dtype: int64


  for chunk in chunk_iter:


Registered    61730144
dtype: int64


  for chunk in chunk_iter:


Registered    63158272
dtype: int64


  for chunk in chunk_iter:


Registered    65002199
dtype: int64


  for chunk in chunk_iter:


Registered    68140196
dtype: int64


  for chunk in chunk_iter:


Registered    71800637
dtype: int64


  for chunk in chunk_iter:


Registered    74999579
dtype: int64
Registered    56836607
dtype: int64


## 4. Memory Profiling and Tracking
It's helpful to track memory usage before and after optimizations. This can be done using the `memory_usage()` method.

In [16]:
# Load the dataset
df = pd.read_csv("voters.csv")

# Initial data types check
print("Initial data types:")
print(df.dtypes)

# Focused optimizations: Convert key columns to 'category' for memory efficiency
df['INSEE code'] = df['INSEE code'].astype('category')
df['Coordinates'] = df['Coordinates'].astype('category')
df['Polling station name'] = df['Polling station name'].astype('category')

# Check memory usage before and after optimization
print("Memory usage before optimization (bytes):")
print(df.memory_usage(deep=True))

# Apply optimizations
print("Memory usage after optimization (bytes):")
print(df.memory_usage(deep=True))

# Final data types after optimization
print("Final data types:")
print(df.dtypes)

  df = pd.read_csv("voters.csv")


Initial data types:
Unnamed: 0                   int64
Department code             object
Department                  object
Constituency code            int64
Constituency                object
Commune code                 int64
Commune                     object
Polling station             object
Registered                   int64
Abstentions                  int64
% Abs/Reg                  float64
Voters                       int64
% Vot/Reg                  float64
None of the above(NOTA)      int64
% NOTA/Reg                 float64
% NOTA/Vot                 float64
Nulls                        int64
% Nulls/Reg                float64
% Nulls/Vot                float64
Expressed                    int64
% Exp/Reg                  float64
% Exp/Vot                  float64
Signboard                    int64
Sex                         object
Surname                     object
First name                  object
Voted                        int64
% Votes/Reg                float64


## Summary
This tutorial covered essential tips to handle large datasets in pandas without running out of memory:

1. **Load only necessary columns/rows** using `usecols` and `nrows`. 
2. **Optimize data types** by converting string columns to `category` and downcasting numerical columns.
3. **Process data in chunks** using `chunksize` for efficient loading.
4. **Monitor memory usage** before and after optimizations with `memory_usage`.