## Working With Large Datasets in Pandas

Is it possible to work with big data in Python without PySpark and the like?! Yes, it can be done with Pandas too... But there are techniques to avoid disaster!

In [1]:
import pandas as pd

Understanding the sizes:

In [2]:
!wc ./datasets/* \
| tr -s ' ' \
| head -n -1 \
| awk '{ printf "Filepath: %s\nLines: %i\tSize in MB: %.2f\n", $4, $1, $3/1024^2 }'

Filepath: ./datasets/car_evaluation.csv
Lines: 1728	Size in MB: 0.05
Filepath: ./datasets/titanic.csv
Lines: 892	Size in MB: 0.06
Filepath: ./datasets/used_car_prices.csv
Lines: 4501	Size in MB: 0.28
Filepath: ./datasets/youtube_trends_us.csv
Lines: 40950	Size in MB: 59.85


### Technique No. 1: Only load the required columns

When you want to do sentiment analysis on video titles, it is not necessary to read the video id, and even its description...

In [3]:
df = pd.read_csv('./datasets/youtube_trends_us.csv', usecols=['title', 'tags', 'views', 'likes'])
df.head()

Unnamed: 0,title,tags,views,likes
0,WE WANT TO TALK ABOUT OUR MARRIAGE,SHANtell martin,748374,57527
1,The Trump Presidency: Last Week Tonight with J...,"last week tonight trump presidency|""last week ...",2418783,97185
2,"Racist Superman | Rudy Mancuso, King Bach & Le...","racist superman|""rudy""|""mancuso""|""king""|""bach""...",3191434,146033
3,Nickelback Lyrics: Real or Fake?,"rhett and link|""gmm""|""good mythical morning""|""...",343168,10172
4,I Dare You: GOING BALD!?,"ryan|""higa""|""higatv""|""nigahiga""|""i dare you""|""...",2095731,132235


In [4]:
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40949 entries, 0 to 40948
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   title   40949 non-null  object
 1   tags    40949 non-null  object
 2   views   40949 non-null  int64 
 3   likes   40949 non-null  int64 
dtypes: int64(2), object(2)
memory usage: 18.8 MB


More details of the size of each column (in bytes):

In [5]:
df.memory_usage(deep=True)

Index         128
title     4628488
tags     14452593
views      327592
likes      327592
dtype: int64

### Technique No. 2: Chunking

In [8]:
chunks = pd.read_csv('./datasets/youtube_trends_us.csv', chunksize=5000)
type(chunks)

pandas.io.parsers.readers.TextFileReader

In [7]:
for chunk in chunks:
    display(chunk['title'].str.title())

0                      We Want To Talk About Our Marriage
1       The Trump Presidency: Last Week Tonight With J...
2       Racist Superman | Rudy Mancuso, King Bach & Le...
3                        Nickelback Lyrics: Real Or Fake?
4                                I Dare You: Going Bald!?
                              ...                        
4995    Youtubers React To Their Old Youtube Channel P...
4996              Chris Stapleton - A Simple Song (Audio)
4997    Marvel Studios' Avengers: Infinity War Officia...
4998    Prince Harry And Meghan Markle Get The Alison ...
4999    Rolled Ice Cream Diy How To Make Rolled Ice Cr...
Name: title, Length: 5000, dtype: object

5000             Migos, Nicki Minaj, Cardi B - Motorsport
5001    Jurassic World: Fallen Kingdom - Official Trai...
5002                     Dude Vs. Wild - Nevada Mountains
5003    Alita: Battle Angel | Official Trailer [Hd] | ...
5004    The Game Awards - Full Show With Death Strandi...
                              ...                        
9995                                         2016 Vs 2017
9996            Dude Perfect Face Off | What'S In The Box
9997                                Mst3K (2017) Yule Log
9998    Titanic Transformation! The Rose Look Feat. Gu...
9999      Making New Sounds Using Artificial Intelligence
Name: title, Length: 5000, dtype: object

10000                                            So Sorry.
10001     Justin Timberlake - Introducing Man Of The Woods
10002        Developer Update | Happy New Year | Overwatch
10003    Fifty Shades Freed - Mrs. Grey Will See You No...
10004    Børns, Lana Del Rey - God Save Our Young Blood...
                               ...                        
14995         What If You Never Ate Fruits And Vegetables?
14996               Entire Breaking Bad Series In 1 Minute
14997                          High Fidelity Mixtape Rules
14998      We Share Our Home With 14 Bears | Beast Buddies
14999    Bruno Mars And Cardi B - Finesse (Live From Th...
Name: title, Length: 5000, dtype: object

15000    Dj Khaled, Rihanna - Wild Thoughts (2018 Live ...
15001    Hillary Clinton, Cardi B & More Audition For F...
15002      Childish Gambino Grammy Awards Performance 2018
15003     Unsane | Official Trailer | In Theaters March 23
15004    Bruno Mars Wins Record Of The Year | Acceptanc...
                               ...                        
19995    Double Rainbow Unicorn Apple Pie | How To Cook...
19996                               The First Time We Met!
19997                   Bishop Briggs - White Flag (Audio)
19998        Troye Sivan & Ariana Grande Working Together!
19999    Jimmy Fallon Does Special Five-Minute Homemade...
Name: title, Length: 5000, dtype: object

20000    Stephen A. Shares Theory On Why Spurs' Kawhi L...
20001                 Imagine Dragons - Next To Me (Audio)
20002                                                 Jack
20003      Nashville On Cmt | Final Episodes Coming June 7
20004    Snoop Dogg - One More Day (Audio) Ft. Charlie ...
                               ...                        
24995                                             Spinners
24996              What Spring Looks Like Around The World
24997                              Sing Anything Challenge
24998                                        Top Breeder 🐕
24999            Charades With Aaron Paul And Karlie Kloss
Name: title, Length: 5000, dtype: object

25000                     10$ Drum - Faded ( Alan Walker )
25001    Crushing And Slicing Red Hot Steel With Hydrau...
25002    The Infinity War Trailer But I Just Name Chara...
25003                       Keyboard Cat, Bento, A Tribute
25004    Adam Scott Hijacks A Stranger'S Tinder | Vanit...
                               ...                        
29995    Teens React To Mirror-Polished Japanese Foil B...
29996                   Festival Rainbow Makeup & Lookbook
29997    24 Hours With Camila Cabello: Inside Her First...
29998                        Honest Trailers - Baby Driver
29999                                       James Bay - Us
Name: title, Length: 5000, dtype: object

30000                                        Insta-Justice
30001      Bodyarmor Sports Drink | James Harden Thanks...
30002    Marvel Studios' Avengers: Infinity War | 10-Ye...
30003                         How The Squid Lost Its Shell
30004                     Huge Problem At The New House...
                               ...                        
34995       We Tried To Re-Create This Giant Cinnamon Roll
34996                      Bangabandhu Satellite-1 Mission
34997    Kylie Cosmetics X Kris Jenner Collection | Swa...
34998              Wearing Fashion Nova Outfits For A Week
34999    Rita Ora - Girls Ft. Cardi B, Bebe Rexha & Cha...
Name: title, Length: 5000, dtype: object

35000         Donald Glover On This Is America Music Video
35001          I Got My Apartment Professionally Organized
35002                             Getting Some Air, Atlas?
35003                  Charlie Puth - Boy [Official Audio]
35004                   Christina Aguilera - Twice (Audio)
                               ...                        
39995    Bumblebee (2018) - Official Teaser Trailer - P...
39996    Ralph Breaks The Internet: Wreck-It Ralph 2 Of...
39997                Old School Trick Shots | Dude Perfect
39998     Ethan Hawke Knows To Seek Knowledge From Masters
39999    Sabrina Carpenter - Almost Love (Official Lyri...
Name: title, Length: 5000, dtype: object

40000         Live Pd: Can I Text My Mom? (Season 2) | A&E
40001                           Blind Ice Cream Taste Test
40002              Iphone X — Animoji: Taxi Driver — Apple
40003           Suspiria - Teaser Trailer | Amazon Studios
40004    Game Theory: Fnaf Stumped Me! (Fnaf 6 Ultimate...
                               ...                        
40944                         The Cat Who Caught The Laser
40945                           True Facts : Ant Mutualism
40946    I Gave Safiya Nygaard A Perfect Hair Makeover ...
40947                  How Black Panther Should Have Ended
40948    Official Call Of Duty®: Black Ops 4 — Multipla...
Name: title, Length: 949, dtype: object

In [9]:
results = map(lambda x: x['title'].str.title(), chunks)
list(results)

[0                      We Want To Talk About Our Marriage
 1       The Trump Presidency: Last Week Tonight With J...
 2       Racist Superman | Rudy Mancuso, King Bach & Le...
 3                        Nickelback Lyrics: Real Or Fake?
 4                                I Dare You: Going Bald!?
                               ...                        
 4995    Youtubers React To Their Old Youtube Channel P...
 4996              Chris Stapleton - A Simple Song (Audio)
 4997    Marvel Studios' Avengers: Infinity War Officia...
 4998    Prince Harry And Meghan Markle Get The Alison ...
 4999    Rolled Ice Cream Diy How To Make Rolled Ice Cr...
 Name: title, Length: 5000, dtype: object,
 5000             Migos, Nicki Minaj, Cardi B - Motorsport
 5001    Jurassic World: Fallen Kingdom - Official Trai...
 5002                     Dude Vs. Wild - Nevada Mountains
 5003    Alita: Battle Angel | Official Trailer [Hd] | ...
 5004    The Game Awards - Full Show With Death Strandi...
             

### Technique No. 3: Make numeric types as small as possible

**NOTE**: It is better to use this technique along with the above methods...

In [10]:
df = pd.read_csv('./datasets/youtube_trends_us.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40949 entries, 0 to 40948
Data columns (total 16 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   video_id                40949 non-null  object
 1   trending_date           40949 non-null  object
 2   title                   40949 non-null  object
 3   channel_title           40949 non-null  object
 4   category_id             40949 non-null  int64 
 5   publish_time            40949 non-null  object
 6   tags                    40949 non-null  object
 7   views                   40949 non-null  int64 
 8   likes                   40949 non-null  int64 
 9   dislikes                40949 non-null  int64 
 10  comment_count           40949 non-null  int64 
 11  thumbnail_link          40949 non-null  object
 12  comments_disabled       40949 non-null  bool  
 13  ratings_disabled        40949 non-null  bool  
 14  video_error_or_removed  40949 non-null  bool  
 15  de

In [11]:
df.select_dtypes(include=['number']).agg(['min', 'max'])

Unnamed: 0,category_id,views,likes,dislikes,comment_count
min,1,549,0,0,0
max,43,225211923,5613827,1674420,1361580


In [12]:
df = df.astype({'category_id': 'int8'})
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40949 entries, 0 to 40948
Data columns (total 16 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   video_id                40949 non-null  object
 1   trending_date           40949 non-null  object
 2   title                   40949 non-null  object
 3   channel_title           40949 non-null  object
 4   category_id             40949 non-null  int8  
 5   publish_time            40949 non-null  object
 6   tags                    40949 non-null  object
 7   views                   40949 non-null  int64 
 8   likes                   40949 non-null  int64 
 9   dislikes                40949 non-null  int64 
 10  comment_count           40949 non-null  int64 
 11  thumbnail_link          40949 non-null  object
 12  comments_disabled       40949 non-null  bool  
 13  ratings_disabled        40949 non-null  bool  
 14  video_error_or_removed  40949 non-null  bool  
 15  de

In [13]:
df = pd.read_csv('./datasets/youtube_trends_us.csv', dtype={'category_id': 'int8'})
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40949 entries, 0 to 40948
Data columns (total 16 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   video_id                40949 non-null  object
 1   trending_date           40949 non-null  object
 2   title                   40949 non-null  object
 3   channel_title           40949 non-null  object
 4   category_id             40949 non-null  int8  
 5   publish_time            40949 non-null  object
 6   tags                    40949 non-null  object
 7   views                   40949 non-null  int64 
 8   likes                   40949 non-null  int64 
 9   dislikes                40949 non-null  int64 
 10  comment_count           40949 non-null  int64 
 11  thumbnail_link          40949 non-null  object
 12  comments_disabled       40949 non-null  bool  
 13  ratings_disabled        40949 non-null  bool  
 14  video_error_or_removed  40949 non-null  bool  
 15  de