<div style="color:#006666; padding:0px 10px; border-radius:5px; font-size:18px;"><h1 style='margin:10px 5px'>Handling Large Data</h1>
</div>

© Copyright Machine Learning Plus

 <div class="alert alert-info" style="background-color:#006666; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>1. Sampling and Subsetting on Load</h2>
</div>

__When to use__

You have a large dataframe and you need only a smaller portion of the rows sampled randomly. Sometimes you might want only select columns from it.

In such case, you can avoid loading the entire dataframe and then sampling from it, by using `skiprows` to let `read_csv` know which rows to skip. 

It accepts a function as an argument, using which you can randomize the rows being read.

__Why__

Doing these techniques will reduce the memory required to hold the data, allowing you to accomodate larger datasets in the limited memory.

In [None]:
import numpy as np
import pandas as pd

In [None]:
# read only random 10% of the rows.
df = pd.read_csv('Datasets/large_dataset.csv', 
                 skiprows=lambda x: x != 0 and np.random.rand() > 0.01)
df.head()

Unnamed: 0,HasTpm,Census_OSInstallLanguageIdentifier,LocaleEnglishNameIdentifier,EngineVersion,UacLuaenable,Census_MDC2FormFactor,Census_IsSecureBootEnabled,Census_OSVersion,Census_GenuineStateName,Census_InternalPrimaryDisplayResolutionHorizontal,RtpStateBitfield,Census_ActivationChannel,GeoNameIdentifier,Census_OSUILocaleIdentifier,Census_OSEdition,AVProductsInstalled,Census_FirmwareManufacturerIdentifier,Census_OEMNameIdentifier,Census_SystemVolumeTotalCapacity,Census_OSVersion_0
0,1,0.3562,0.2347,0.4119,0,1,0,0.03644,0,1366.0,0,0,0.1718,0.3555,0,1.0,0.10547,0.1063,37974.0,0
1,1,0.116,0.03827,0.431,0,1,0,0.02428,0,1366.0,0,0,0.03885,0.1167,1,2.0,0.0433,0.1038,694454.0,0
2,1,0.014786,0.01521,0.4119,0,0,0,0.02428,0,1920.0,0,0,0.01536,0.014824,0,1.0,0.3025,0.03476,113971.0,0
3,1,0.04846,0.2347,0.02391,0,5,1,0.1009,0,1366.0,0,2,0.000353,0.04892,1,1.0,0.1317,0.1164,937177.0,0
4,1,0.05734,0.04608,0.4119,0,1,1,0.000676,0,1366.0,0,2,0.03323,0.0575,0,1.0,0.01197,0.1164,934026.0,0


In [None]:
df.shape

(9895, 20)

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9895 entries, 0 to 9894
Data columns (total 20 columns):
 #   Column                                             Non-Null Count  Dtype  
---  ------                                             --------------  -----  
 0   HasTpm                                             9895 non-null   int64  
 1   Census_OSInstallLanguageIdentifier                 9813 non-null   float64
 2   LocaleEnglishNameIdentifier                        9895 non-null   float64
 3   EngineVersion                                      9895 non-null   float64
 4   UacLuaenable                                       9895 non-null   int64  
 5   Census_MDC2FormFactor                              9895 non-null   int64  
 6   Census_IsSecureBootEnabled                         9895 non-null   int64  
 7   Census_OSVersion                                   9895 non-null   float64
 8   Census_GenuineStateName                            9895 non-null   int64  
 9   Census_I

Additionally you can specify only the columns you want to read in, so the other columns won't be loaded to save memory.

In [None]:
%%time

# specify the columns: 
df = pd.read_csv('Datasets/large_dataset.csv', 
                 skiprows=lambda x: x != 0 and np.random.rand() > 0.01,
                 usecols=['HasTpm', 'GeoNameIdentifier', 'RtpStateBitfield'])
df.head()

Wall time: 1.68 s


Unnamed: 0,HasTpm,RtpStateBitfield,GeoNameIdentifier
0,1,0,0.00949
1,0,0,0.01823
2,1,0,0.03867
3,1,0,0.01823
4,1,0,0.1718


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10028 entries, 0 to 10027
Data columns (total 3 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   HasTpm             10028 non-null  int64  
 1   RtpStateBitfield   10028 non-null  int64  
 2   GeoNameIdentifier  10026 non-null  float64
dtypes: float64(1), int64(2)
memory usage: 235.2 KB


__Mention the datatypes as well to bring down the memory consumption further.__

First, let's check what sort of datatypes are required to hold the data.

In [None]:
df.describe()

In [None]:
# specify the columns: 
df = pd.read_csv('Datasets/large_dataset.csv', 
                 skiprows=lambda x: x != 0 and np.random.rand() > 0.01,
                 usecols=['HasTpm', 'GeoNameIdentifier', 'RtpStateBitfield'],
                 dtype={'HasTpm':'bool', 'GeoNameIdentifier':np.float16, 'RtpStateBitfield':np.int8})

df.head()

Unnamed: 0,HasTpm,RtpStateBitfield,GeoNameIdentifier
0,True,0,0.038849
1,True,0,0.038849
2,True,0,0.012543
3,True,0,0.018234
4,True,0,0.006493


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10067 entries, 0 to 10066
Data columns (total 3 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   HasTpm             10067 non-null  bool   
 1   RtpStateBitfield   10067 non-null  int8   
 2   GeoNameIdentifier  10067 non-null  float16
dtypes: bool(1), float16(1), int8(1)
memory usage: 39.4 KB


 <div class="alert alert-info" style="background-color:#006666; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>2. Efficient File Formats</h2>
</div>
 
 __When to use__
 
 When the file size is large and stored in legacy data formats such as .csv or excel files, it could take more space in disk and consumes more time to read and write data.
 
 You can do better using modern file formats to save data such as:
 1. feather
 2. parquet

In [None]:
# !pip install pyarrow

In [None]:
%%time
df = pd.read_csv('Datasets/large_dataset.csv')
df.head()

Wall time: 2.77 s


Unnamed: 0,HasTpm,Census_OSInstallLanguageIdentifier,LocaleEnglishNameIdentifier,EngineVersion,UacLuaenable,Census_MDC2FormFactor,Census_IsSecureBootEnabled,Census_OSVersion,Census_GenuineStateName,Census_InternalPrimaryDisplayResolutionHorizontal,RtpStateBitfield,Census_ActivationChannel,GeoNameIdentifier,Census_OSUILocaleIdentifier,Census_OSEdition,AVProductsInstalled,Census_FirmwareManufacturerIdentifier,Census_OEMNameIdentifier,Census_SystemVolumeTotalCapacity,Census_OSVersion_0
0,1,0.116,0.010284,0.4119,0,1,1,0.039,0,1366.0,0,2,0.01046,0.1167,2,2.0,0.08966,0.10034,475799.0,0
1,1,0.3562,0.2347,0.431,0,1,0,0.1584,0,1366.0,0,0,0.1718,0.3555,1,2.0,0.01384,0.10034,461478.0,0
2,1,0.3562,0.2347,0.02974,0,1,0,0.02313,2,1366.0,0,3,0.1718,0.3555,0,1.0,0.10547,0.1063,476438.0,0
3,1,0.3562,0.2347,0.4119,0,1,0,0.03644,0,1366.0,0,0,0.1718,0.3555,0,1.0,0.10547,0.1063,37974.0,0
4,1,0.116,0.03738,0.4119,0,0,0,0.001266,0,1280.0,0,0,0.03815,0.1167,1,1.0,0.0433,0.1038,475955.0,0


__Check file size__

In [None]:
from pathlib import Path
Path('Datasets/large_dataset.csv').stat().st_size / 1024**2

106.19118595123291

__To Feather format__

In [None]:
%%time
df.to_feather('Datasets/large_dataset.feather')

Wall time: 507 ms


__File size__

In [None]:
Path('Datasets/large_dataset.feather').stat().st_size / 1024**2

35.60710334777832

In [None]:
df.shape

(1000000, 20)

That's a tremendous saving and the file loads back instantly.

__Read it back and record time__

In [None]:
%%time
df = pd.read_feather('Datasets/large_dataset.feather')
df.head()

Wall time: 113 ms


Unnamed: 0,HasTpm,Census_OSInstallLanguageIdentifier,LocaleEnglishNameIdentifier,EngineVersion,UacLuaenable,Census_MDC2FormFactor,Census_IsSecureBootEnabled,Census_OSVersion,Census_GenuineStateName,Census_InternalPrimaryDisplayResolutionHorizontal,RtpStateBitfield,Census_ActivationChannel,GeoNameIdentifier,Census_OSUILocaleIdentifier,Census_OSEdition,AVProductsInstalled,Census_FirmwareManufacturerIdentifier,Census_OEMNameIdentifier,Census_SystemVolumeTotalCapacity,Census_OSVersion_0
0,1,0.116,0.010284,0.4119,0,1,1,0.039,0,1366.0,0,2,0.01046,0.1167,2,2.0,0.08966,0.10034,475799.0,0
1,1,0.3562,0.2347,0.431,0,1,0,0.1584,0,1366.0,0,0,0.1718,0.3555,1,2.0,0.01384,0.10034,461478.0,0
2,1,0.3562,0.2347,0.02974,0,1,0,0.02313,2,1366.0,0,3,0.1718,0.3555,0,1.0,0.10547,0.1063,476438.0,0
3,1,0.3562,0.2347,0.4119,0,1,0,0.03644,0,1366.0,0,0,0.1718,0.3555,0,1.0,0.10547,0.1063,37974.0,0
4,1,0.116,0.03738,0.4119,0,0,0,0.001266,0,1280.0,0,0,0.03815,0.1167,1,1.0,0.0433,0.1038,475955.0,0


In [None]:
df.shape

(1000000, 20)

__Store as Parquet file__

In [None]:
%%time
df.to_parquet('Datasets/large_dataset.parquet')

Wall time: 997 ms


__File size__

In [None]:
Path('Datasets/large_dataset.parquet').stat().st_size / 1024**2

13.447064399719238

__Load it back and record time__

In [None]:
%%time
df = pd.read_parquet('Datasets/large_dataset.parquet')
df.head()

Wall time: 233 ms


Unnamed: 0,HasTpm,Census_OSInstallLanguageIdentifier,LocaleEnglishNameIdentifier,EngineVersion,UacLuaenable,Census_MDC2FormFactor,Census_IsSecureBootEnabled,Census_OSVersion,Census_GenuineStateName,Census_InternalPrimaryDisplayResolutionHorizontal,RtpStateBitfield,Census_ActivationChannel,GeoNameIdentifier,Census_OSUILocaleIdentifier,Census_OSEdition,AVProductsInstalled,Census_FirmwareManufacturerIdentifier,Census_OEMNameIdentifier,Census_SystemVolumeTotalCapacity,Census_OSVersion_0
0,1,0.116,0.010284,0.4119,0,1,1,0.039,0,1366.0,0,2,0.01046,0.1167,2,2.0,0.08966,0.10034,475799.0,0
1,1,0.3562,0.2347,0.431,0,1,0,0.1584,0,1366.0,0,0,0.1718,0.3555,1,2.0,0.01384,0.10034,461478.0,0
2,1,0.3562,0.2347,0.02974,0,1,0,0.02313,2,1366.0,0,3,0.1718,0.3555,0,1.0,0.10547,0.1063,476438.0,0
3,1,0.3562,0.2347,0.4119,0,1,0,0.03644,0,1366.0,0,0,0.1718,0.3555,0,1.0,0.10547,0.1063,37974.0,0
4,1,0.116,0.03738,0.4119,0,0,0,0.001266,0,1280.0,0,0,0.03815,0.1167,1,1.0,0.0433,0.1038,475955.0,0


__File Size on Disk__

1. CSV : 106mb
2. feather: 36mb
3. parquet: 13.4mb

__Time to load__

1. CSV : 2.45s
2. feather: 113ms
3. parquet: 148ms

__Parquet requires the least space but take a bit longer to load compared to feather. Feather is the fastest. Parquest requires least space.__

 <div class="alert alert-info" style="background-color:#006666; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>3. Working with HDF5</h2>
</div>

HDF5 is a type of file storage that stores the data in a hierarchical structure. It is super fast to retrieve the data.

we create a HDF5 store first and store the dataframes inside that in a hierarchy.

In [None]:
# !pip install tables==3.6.1

In [None]:
import numpy as np
from pandas import HDFStore, DataFrame

Create (or open) an hdf5 file and opens in append mode

In [None]:
hdf = HDFStore('Datasets/hdfstorage.h5', mode='a')
hdf

<class 'pandas.io.pytables.HDFStore'>
File path: Datasets/hdfstorage.h5

__Create data__

In [None]:
df = DataFrame(np.random.rand(5,3), columns=('A','B','C'))
df.head()

Unnamed: 0,A,B,C
0,0.516243,0.026702,0.428357
1,0.459609,0.497303,0.039483
2,0.17503,0.419541,0.231524
3,0.142982,0.971437,0.663617
4,0.961227,0.887601,0.415256


__Store data to table__

In [None]:
hdf.put('df', df, format='table', data_columns=('A','B','C'))

In [None]:
hdf['df']

Unnamed: 0,A,B,C
0,0.516243,0.026702,0.428357
1,0.459609,0.497303,0.039483
2,0.17503,0.419541,0.231524
3,0.142982,0.971437,0.663617
4,0.961227,0.887601,0.415256


__Append to existing dataset__

In [None]:
hdf.append('df', df, format='table', data_columns=('A','B','C'))

__Check contents__

In [None]:
hdf['df']

Unnamed: 0,A,B,C
0,0.516243,0.026702,0.428357
1,0.459609,0.497303,0.039483
2,0.17503,0.419541,0.231524
3,0.142982,0.971437,0.663617
4,0.961227,0.887601,0.415256
0,0.516243,0.026702,0.428357
1,0.459609,0.497303,0.039483
2,0.17503,0.419541,0.231524
3,0.142982,0.971437,0.663617
4,0.961227,0.887601,0.415256


__Close connection__

In [None]:
hdf.close()

__Read from Store again__

In [None]:
from pandas import read_hdf
df = read_hdf('Datasets/hdfstorage.h5', 'df')
df

Unnamed: 0,A,B,C
0,0.516243,0.026702,0.428357
1,0.459609,0.497303,0.039483
2,0.17503,0.419541,0.231524
3,0.142982,0.971437,0.663617
4,0.961227,0.887601,0.415256
0,0.516243,0.026702,0.428357
1,0.459609,0.497303,0.039483
2,0.17503,0.419541,0.231524
3,0.142982,0.971437,0.663617
4,0.961227,0.887601,0.415256


If you want to apply a rule when importing, like subsampling or importing select columns, its very direct.

In [None]:
df = read_hdf('Datasets/hdfstorage.h5', 'df', where=['A>.5'], columns=['A','B'])
df

Unnamed: 0,A,B
0,0.516243,0.026702
4,0.961227,0.887601
0,0.516243,0.026702
4,0.961227,0.887601


__Store multiple tables under hierarchical directories inside HDFStore__

In [None]:
hdf = HDFStore('Datasets/hdfstorage.h5')

In [None]:
hdf.put('sales/t1', pd.DataFrame(np.random.rand(20,5)))
hdf.put('sales/t2', pd.DataFrame(np.random.rand(10,3)))
hdf.put('catalog/t1', pd.DataFrame(np.random.rand(15,2)))

__See the tables__

In [None]:
hdf.keys()

['/df', '/sales/t1', '/sales/t2', '/large/df', '/catalog/t1']

In [None]:
print(hdf.info())

<class 'pandas.io.pytables.HDFStore'>
File path: Datasets/hdfstorage.h5
/catalog/t1            frame        (shape->[15,2])                                                   
/df                    frame_table  (typ->appendable,nrows->10,ncols->3,indexers->[index],dc->[A,B,C])
/large/df              frame        (shape->[1000000,20])                                             
/sales/t1              frame        (shape->[20,5])                                                   
/sales/t2              frame        (shape->[10,3])                                                   


__Put Large Data__

In [None]:
df = pd.read_csv('Datasets/large_dataset.csv')
hdf.put('large/df', df)

In [None]:
hdf.close()

__Read it back and time it__

In [None]:
%%time
hdf = HDFStore('Datasets/hdfstorage.h5')

Wall time: 10 ms


In [None]:
%%time
df = hdf['large/df']  # address in hdfS
df.head()

Wall time: 530 ms


Unnamed: 0,HasTpm,Census_OSInstallLanguageIdentifier,LocaleEnglishNameIdentifier,EngineVersion,UacLuaenable,Census_MDC2FormFactor,Census_IsSecureBootEnabled,Census_OSVersion,Census_GenuineStateName,Census_InternalPrimaryDisplayResolutionHorizontal,RtpStateBitfield,Census_ActivationChannel,GeoNameIdentifier,Census_OSUILocaleIdentifier,Census_OSEdition,AVProductsInstalled,Census_FirmwareManufacturerIdentifier,Census_OEMNameIdentifier,Census_SystemVolumeTotalCapacity,Census_OSVersion_0
0,1,0.116,0.010284,0.4119,0,1,1,0.039,0,1366.0,0,2,0.01046,0.1167,2,2.0,0.08966,0.10034,475799.0,0
1,1,0.3562,0.2347,0.431,0,1,0,0.1584,0,1366.0,0,0,0.1718,0.3555,1,2.0,0.01384,0.10034,461478.0,0
2,1,0.3562,0.2347,0.02974,0,1,0,0.02313,2,1366.0,0,3,0.1718,0.3555,0,1.0,0.10547,0.1063,476438.0,0
3,1,0.3562,0.2347,0.4119,0,1,0,0.03644,0,1366.0,0,0,0.1718,0.3555,0,1.0,0.10547,0.1063,37974.0,0
4,1,0.116,0.03738,0.4119,0,0,0,0.001266,0,1280.0,0,0,0.03815,0.1167,1,1.0,0.0433,0.1038,475955.0,0


In [None]:
hdf.close()

__Using read_hdf__

In [None]:
%%time
hdf = read_hdf('Datasets/hdfstorage.h5', key='large/df', mode='a')
hdf

 <div class="alert alert-info" style="background-color:#006666; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>4. Reading a Big File in Chunks</h2>
</div>

__When to use__ 

Sometimes you might not want to read the full file as a dataframe. Instead, you want to select specific rows and columns to go in your dataframe.

Ex: How to extract all the records with dose > 0.5? (it could be any logic. Ex: extract the largest in each chunk).

We read all records, but only the filtered records make it to dataframe, thus saving system memory. Useful especially for large datasets.

In [None]:
import pandas as pd

In [None]:
df_chunker = pd.read_csv("Datasets/ToothGrowth.txt", chunksize=10)
print(df_chunker)

<pandas.io.parsers.TextFileReader object at 0x000002107A75FF88>


In [None]:
pd.read_csv("Datasets/ToothGrowth.txt").head()

Unnamed: 0,len,supp,dose
0,4.2,VC,0.5
1,11.5,VC,0.5
2,7.3,VC,0.5
3,5.8,VC,0.5
4,6.4,VC,0.5


`df_chunker` can be iterated only once.

In [None]:
# Dose > .5 and Every 5th item
master = []

for i, df in enumerate(df_chunker):
    df = df.loc[(df.dose > .5) & (df.index % 5 == 0), :]
    if df.shape[0] >= 1:
        master.append(df)
    
master

[     len supp  dose
 10  16.5   VC     1
 15  17.3   VC     1,
      len supp  dose
 20  23.6   VC     2
 25  32.5   VC     2,
      len supp  dose
 40  19.7   OJ     1
 45  25.2   OJ     1,
      len supp  dose
 50  25.5   OJ     2
 55  30.9   OJ     2]

Combine all the dataframes in the list.

In [None]:
pd.concat(master)

Unnamed: 0,len,supp,dose
10,16.5,VC,1
15,17.3,VC,1
20,23.6,VC,2
25,32.5,VC,2
40,19.7,OJ,1
45,25.2,OJ,1
50,25.5,OJ,2
55,30.9,OJ,2


### Mini Challenge

Use chunking to extract first 50 `EngineVersion` in the dataset that has 2 `AVProductsInstalled` and `Census_OSEdition` = 0.

```python
path = 'Datasets/large_dataset.csv'
```

In [None]:
import pandas as pd
import numpy as np

In [None]:
path = 'Datasets/large_dataset.csv'

In [None]:
df_chunker = pd.read_csv(path, chunksize=100)
print(df_chunker)

<pandas.io.parsers.TextFileReader object at 0x000002107A829608>


In [None]:
# Every 5th row that has atleast 2 AVProductsInstalled
master = []
engines_extracted = 0

for i, df in enumerate(df_chunker):
    df = df.loc[(df.AVProductsInstalled >= 2) & (df.Census_OSEdition == 0), "EngineVersion"]
    if (df.shape[0] >= 1):
        if engines_extracted <= 50:
            master.append(df)
            engines_extracted += df.shape[0]
        else:
            print("50 EngineVersions extracted")
            break
        
master

50 EngineVersions extracted


[31    0.411900
 47    0.411900
 49    0.431000
 52    0.015270
 60    0.411900
 62    0.431000
 83    0.005165
 99    0.411900
 Name: EngineVersion, dtype: float64,
 105    0.017970
 112    0.411900
 178    0.411900
 179    0.007935
 Name: EngineVersion, dtype: float64,
 227    0.005165
 230    0.431000
 243    0.431000
 256    0.001086
 264    0.431000
 268    0.004078
 269    0.431000
 286    0.411900
 Name: EngineVersion, dtype: float64,
 322    0.431000
 324    0.411900
 328    0.001714
 370    0.411900
 376    0.411900
 383    0.431000
 388    0.411900
 396    0.411900
 Name: EngineVersion, dtype: float64,
 428    0.431000
 437    0.005226
 455    0.431000
 461    0.431000
 470    0.431000
 474    0.015270
 487    0.431000
 492    0.431000
 Name: EngineVersion, dtype: float64,
 509    0.41190
 526    0.41190
 545    0.43100
 551    0.43100
 567    0.41190
 598    0.02391
 Name: EngineVersion, dtype: float64,
 604    0.431000
 610    0.431000
 635    0.004078
 652    0.431000
 661

In [None]:
df_out = pd.concat(master)
df_out

31     0.411900
47     0.411900
49     0.431000
52     0.015270
60     0.411900
62     0.431000
83     0.005165
99     0.411900
105    0.017970
112    0.411900
178    0.411900
179    0.007935
227    0.005165
230    0.431000
243    0.431000
256    0.001086
264    0.431000
268    0.004078
269    0.431000
286    0.411900
322    0.431000
324    0.411900
328    0.001714
370    0.411900
376    0.411900
383    0.431000
388    0.411900
396    0.411900
428    0.431000
437    0.005226
455    0.431000
461    0.431000
470    0.431000
474    0.015270
487    0.431000
492    0.431000
509    0.411900
526    0.411900
545    0.431000
551    0.431000
567    0.411900
598    0.023910
604    0.431000
610    0.431000
635    0.004078
652    0.431000
661    0.411900
678    0.431000
685    0.015270
696    0.411900
732    0.431000
774    0.015270
778    0.431000
791    0.411900
Name: EngineVersion, dtype: float64

In [None]:
df_out.shape

(54,)

<div class="alert alert-info" style="background-color:#006666; color:white; padding:0px 10px; border-radius:5px;"><h2 style='margin:10px 5px'>5. Load in Database and read from it</h2>
</div>

When you have a large file that cannot be loaded to memory,  load it to a database in chunks and then query from it using pandas. [link](https://www.youtube.com/watch?v=xKMyk4wDHnQ)

[SQLite Doc](https://docs.python.org/3/library/sqlite3.html)

In [None]:
import numpy as np
import pandas as pd
import sqlite3

__Steps__:

1. Create connection to database.
2. Load the csv to database in chunks
3. Use SQL Query to load data to pandas

In [None]:
filepath = 'Datasets/large_dataset.csv'
pd.read_csv(filepath, nrows=2)

Unnamed: 0,HasTpm,Census_OSInstallLanguageIdentifier,LocaleEnglishNameIdentifier,EngineVersion,UacLuaenable,Census_MDC2FormFactor,Census_IsSecureBootEnabled,Census_OSVersion,Census_GenuineStateName,Census_InternalPrimaryDisplayResolutionHorizontal,RtpStateBitfield,Census_ActivationChannel,GeoNameIdentifier,Census_OSUILocaleIdentifier,Census_OSEdition,AVProductsInstalled,Census_FirmwareManufacturerIdentifier,Census_OEMNameIdentifier,Census_SystemVolumeTotalCapacity,Census_OSVersion_0
0,1,0.116,0.010284,0.4119,0,1,1,0.039,0,1366.0,0,2,0.01046,0.1167,2,2.0,0.08966,0.10034,475799.0,0
1,1,0.3562,0.2347,0.431,0,1,0,0.1584,0,1366.0,0,0,0.1718,0.3555,1,2.0,0.01384,0.10034,461478.0,0


__Load the file in database without loading the full file in pandas__

In [None]:
# create connection
con = sqlite3.connect('files.db')

__Run SQL queries directly__

In [None]:
cur = con.cursor()
cur

<sqlite3.Cursor at 0x2107a83bdc0>

In [None]:
# Execute query. sqlite_master is a system table that contains all table names in sqlite db.
cur.execute("SELECT name FROM sqlite_master")

<sqlite3.Cursor at 0x2107a83bdc0>

In [None]:
# Show the query
print(cur.fetchall())

[('large_dataset',), ('ix_large_dataset_index',)]


Drop Table if any table is already present

In [None]:
# Execute query
cur.execute("drop table large_dataset;")

<sqlite3.Cursor at 0x2107a83bdc0>

In [None]:
# Show the query results
print(cur.fetchall())

[]


Commit changes

In [None]:
# commit the changes
con.commit()

Add data to database table by reading in chunks from pandas and upload it.

In [None]:
# Add data to sqlite table in chunks
chunksize = 5000

for i, df in enumerate(pd.read_csv(filepath, chunksize=chunksize, iterator=True)):
    df = df.rename(columns = {c: c.replace(' ', '').lower() for c in df.columns})
    df['chunk'] = i
    
    df.to_sql('large_dataset', con, if_exists='append')
    if not i % 50: print(f"Chunk: {i} Loaded.")    

Chunk: 0 Loaded.
Chunk: 50 Loaded.
Chunk: 100 Loaded.
Chunk: 150 Loaded.


Read data from database

In [None]:
# Read data from table
df = pd.read_sql_query('SELECT * FROM large_dataset WHERE HASTPM=0', con)
df.head()

Unnamed: 0,index,hastpm,census_osinstalllanguageidentifier,localeenglishnameidentifier,engineversion,uacluaenable,census_mdc2formfactor,census_issecurebootenabled,census_osversion,census_genuinestatename,...,census_activationchannel,geonameidentifier,census_osuilocaleidentifier,census_osedition,avproductsinstalled,census_firmwaremanufactureridentifier,census_oemnameidentifier,census_systemvolumetotalcapacity,census_osversion_0,chunk
0,94,0,0.02843,0.02031,0.02974,0,0,0,0.003431,0,...,0,0.02072,0.02843,1,2.0,0.0433,0.1038,142625.0,0,0
1,229,0,0.3562,0.2347,0.431,0,1,0,0.002861,0,...,0,0.1718,0.3555,0,1.0,0.138,0.1444,144397.0,0,0
2,240,0,,0.2347,0.4119,0,1,0,0.00459,1,...,0,0.1718,0.3555,1,1.0,0.3025,0.002142,136795.0,0,0
3,281,0,,0.0505,0.431,0,1,0,0.000722,0,...,0,0.04752,0.0559,1,1.0,0.3025,0.002142,102400.0,0,0
4,366,0,0.02267,0.01817,0.431,0,0,0,0.004513,0,...,0,0.01823,0.02272,1,1.0,0.3025,0.007965,176389.0,0,0


In [None]:
df.shape

(12149, 22)

__Close connection__

In [None]:
con.close()