### Using pandas efficiently

* As the dataframe size gets larger, efficiency becomes more important.
* SIDE NOTE: cuDF pandas is a mode within the cuDF library that accelerates pandas workflows by utilizing GPUs. It supports 100% of the pandas API, meaning that existing pandas code can run on GPUs with minimal to no modification. For local usage you need NVIDIA GPUs, https://rapids.ai/cudf-pandas/ . We will not be covering this in the course.




* Example: user1_mouse.csv


In [None]:


import pandas as pd

import sys

full_data=pd.read_csv('user1_mouse.csv')

print("The size of object in bytes",sys.getsizeof(full_data))

The size of object in bytes 14257947


### If all columns are not need needed, we can load less data right from the start.

In [None]:
selected_cols=pd.read_csv('user1_mouse.csv',usecols=['User ID','Event Time','Relative X Position','Relative Y Position'])
print("The size of object in bytes",sys.getsizeof(selected_cols)) #dropping some cols and only using some will reduce memory size

The size of object in bytes 2554692


## Using efficient data types makes a huge difference in pandas

Example: using default data types, auto recognized by pandas.

In [None]:
full_data.dtypes

Unnamed: 0,0
Event type,object
User ID,int64
Event Time,int64
Relative X Position,float64
Relative Y Position,float64
Absolute X Position,float64
Absolute Y Position,float64
Mouse Button,object
Button State,object


## Viewing the data usage per column.

In [None]:
full_data.memory_usage(deep=True)

Unnamed: 0,0
Index,132
Event type,4954232
User ID,638632
Event Time,638632
Relative X Position,638632
Relative Y Position,638632
Absolute X Position,638632
Absolute Y Position,638632
Mouse Button,2760126
Button State,2711633


## We can change the data type based on prior information.

* All values of categorical data are either in "categories" or np.nan. Internally, the data structure consists of a categories array and an integer array of codes which point to the real value in the categories array.
* "int64" refers to a 64-bit signed integer, meaning it can store both positive and negative whole numbers, with a range from -9,223,372,036,854,775,808 to +9,223,372,036,854,775,807. This may not be required in most cases.
* "float64" uses Sign bit: 1 bit, Exponent: 11 bits, Significand precision: 52 bits. which also might not be required in most cases.
* These can be downcast by specifying lower range types like "int16","int32", or "float16" (not recommended) , "float32".

In [None]:

#hands on activity, experiment and change the datatypes of some other columns to reduce size of the dataframe.
data_dtype_modified=pd.read_csv('user1_mouse.csv',dtype={"Event type":"category", 'Mouse Button':"category", "Button State":"category"})
data_dtype_modified.dtypes

Unnamed: 0,0
Event type,category
User ID,int64
Event Time,int64
Relative X Position,float64
Relative Y Position,float64
Absolute X Position,float64
Absolute Y Position,float64
Mouse Button,category
Button State,category


In [None]:
data_dtype_modified.memory_usage(deep=True) #reduction in bytes after changining to category column

Unnamed: 0,0
Index,132
Event type,80062
User ID,638632
Event Time,638632
Relative X Position,638632
Relative Y Position,638632
Absolute X Position,638632
Absolute Y Position,638632
Mouse Button,80295
Button State,79958


# calulating the reduction in overall dataframe memory

In [None]:
print("The size of object in bytes",sys.getsizeof(data_dtype_modified))
reduction = data_dtype_modified.memory_usage(deep=True).sum() / full_data.memory_usage(deep=True).sum()
print(f"Reduction : {reduction:0.2f}") #was reduced to 29 percent

The size of object in bytes 4072271
Reduction : 0.29


## Using chunking
* Chunking works well when the operation you’re performing requires zero or minimal coordination between chunks.
* In this example, the CSV file is read in chunks of 1000 rows. Each chunk can then be processed within the loop, allowing for operations such as data cleaning, transformation, or analysis. After processing, chunks can be concatenated or aggregated as needed. Chunking is particularly useful when dealing with datasets that exceed available memory, ensuring efficient and scalable data processing.

In [None]:
import pandas as pd

chunksize = 1000
for chunk in pd.read_csv('user1_mouse.csv', chunksize=chunksize):
    # tasks go here
    print(chunk.shape)

(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(1000, 9)
(829, 9)


### Check out and experiment with parquet and dask in python these are very handy for large datafiles.

## Parallel processing libraries in pyhton.
* The multiprocessing library was popular few years ago. But, now the pandarallel library makes things much more easier for parallel processing.

| Without parallelization	| With parallelization       |
|---------------|----------------------------|
|df.apply(func)	| df.parallel_apply(func)    |
|df.applymap(func)	| df.parallel_applymap(func) |
|df.groupby(args).apply(func)	| df.groupby(args).parallel_apply(func) |
| series.map(func)	| series.parallel_map(func)  |
|series.apply(func)	| series.parallel_apply(func) |

There are more such functions supported by pandarallel

In [None]:
def func(df):
    a=0
    for i in range(100000000):
        a=i/2
    return a


#without parallel code
full_data.apply(func)
#with parallel execution
from pandarallel import pandarallel
#pandarallel.initialize(progress_bar=True) # alternatively initialize(os.cpu_count()-1)
#full_data.parallel_apply(func)


ModuleNotFoundError: No module named 'pandarallel'

# Introduction to working with databases

* Understanding databases is crucial because they allow us to:
    - Store information in a structured and secure way.
    - Retrieve exactly what we need quickly, even when we’re dealing with thousands or millions of records.
    - Build dynamic, data-driven applications.
    - For example, when you log into an app, your profile information is fetched from a database, and when you search for a product online, the results come from a database query.

## Overview of Relational Databases
* A relational database organizes data into tables—much like a spreadsheet—where:
    - Rows represent individual records (for example, a single student’s information).
    - Columns represent attributes of those records (like student ID, name, and grade).
* Primary Key: A unique identifier for each record. For instance, in a ‘Students’ table, a student ID is used so no two records are the same.
* Foreign Key: This is used to link tables together. For example, if you have a ‘Students’ table and a ‘Courses’ table, you can use a foreign key in the ‘Courses’ table to refer to a student in the ‘Students’ table.

## Some basic SQL to job your memory
* SQL, which stands for Structured Query Language, is the standard language used to interact with relational databases.
* With SQL, we can:
    - Create tables using commands like CREATE TABLE.
    - Insert data with INSERT INTO.
    - Retrieve data using the SELECT statement.
    - Update or delete data with UPDATE and DELETE commands.

* For instance, if we want to see all students with a grade above 90, we’d write a query like:
SELECT * FROM Students WHERE grade > 90;

Use the following links to brush up basics: https://www.geeksforgeeks.org/basic-database-concepts/



## SQLite

* Python comes with built-in support for a very lightweight database called SQLite.
* It’s a file-based database engine. This means that the entire database is stored in a single file on your computer, making it very easy to set up and use—no separate server installation is needed.
* SQLite is perfect for learning because it’s simple and doesn’t require any complex configuration. It’s great for small projects, prototyping, or even for mobile applications where a full-scale server isn’t needed.


### We will go over a basic example with the various steps involved:
Connecting to the Database:

In [2]:
import sqlite3
conn = sqlite3.connect(':memory:') # I’m using an in-memory database (this means the database is stored in your RAM and is temporary) by passing ':memory:' to sqlite3.connect(). Can be replaced by filename if file exists.
# in memory, it only lasts as long as the program.
# This conn variable is our connection to the database.

Creating a Table:

In [3]:
cursor = conn.cursor() #a connection (conn) object represents the active link to a database, while a cursor object facilitates the execution of SQL queries and retrieval of results.
cursor.execute('''
    CREATE TABLE students (
        id INTEGER PRIMARY KEY,
        name TEXT,
        grade REAL
    )
''')
#cursor to send SQL commands. The id is our primary key and will auto-increment.

<sqlite3.Cursor at 0x7f4b4b2cf940>

Inserting Data:

In [4]:
cursor.execute("INSERT INTO students (name, grade) VALUES ('Alice', 90)")
cursor.execute("INSERT INTO students (name, grade) VALUES ('Bob', 85)")
conn.commit() # tells the database to save the changes.

Querying the Data:

In [5]:
cursor.execute("SELECT * FROM students")
rows = cursor.fetchall()
print("Students:", rows)
# to save in :memory: database to a file, only needed in such a situation.
conn.execute("VACUUM main INTO 'database1.db'")

conn.close() # automatically saves database file if opened from a file not in :memory:.

Students: [(1, 'Alice', 90.0), (2, 'Bob', 85.0)]
