<a href="https://colab.research.google.com/github/stefisha/StefanVelickovic_Omega_DS_InvestmentRounds/blob/main/VegaIT_Task_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [24]:
%pip install memory_profiler

Collecting memory_profiler
  Downloading memory_profiler-0.61.0-py3-none-any.whl.metadata (20 kB)
Downloading memory_profiler-0.61.0-py3-none-any.whl (31 kB)
Installing collected packages: memory_profiler
Successfully installed memory_profiler-0.61.0


In [25]:
from google.colab import drive
import pandas as pd
from memory_profiler import memory_usage

In [4]:
# Mount Google Drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
# Define the file path to the CSV file on Google Drive
file_path = '/content/drive/MyDrive/Data Science Task - VegaIT/python_task_data.csv'  # Update this path if needed

In [6]:
# Read the CSV file
df = pd.read_csv(file_path)

In [7]:
# Display the first few rows of the data
df.head()

Unnamed: 0,permalink,company,numEmps,category,city,state,fundedDate,raisedAmt,raisedCurrency,round
0,lifelock,LifeLock,,web,Tempe,AZ,1-May-07,6850000,USD,b
1,lifelock,LifeLock,,web,Tempe,AZ,1-Oct-06,6000000,USD,a
2,lifelock,LifeLock,,web,Tempe,AZ,1-Jan-08,25000000,USD,c
3,mycityfaces,MyCityFaces,7.0,web,Scottsdale,AZ,1-Jan-08,50000,USD,seed
4,flypaper,Flypaper,,web,Phoenix,AZ,1-Feb-08,3000000,USD,a


## 1. Processing Data in Chunks

Instead of loading the entire dataset into memory, which can be problematic for large datasets, we process the data in smaller chunks. This way, only a subset of the data is loaded at any time, significantly reducing memory consumption.

In [12]:
# Define chunk size for processing large datasets in small portions
chunk_size = 10000  # Adjust based on your system's memory

**Why this is efficient**:
By using chunks, only the data that is currently being processed resides in memory, and once a chunk is processed, it is discarded, keeping memory usage low.

## 2. Optimizing Data Types with `dtypes`
Pandas will automatically infer data types when loading data, which might not always be memory efficient. You can manually specify the types of each column to minimize the memory footprint, especially for categorical data.

In [13]:
# Define memory-efficient data types for each column
dtype_dict = {
    'permalink': 'category',  # Use 'category' for strings with repeated values to save memory
    'company': 'category',
    'category': 'category',
    'city': 'category',
    'state': 'category',
    'raisedAmt': 'float64',   # Keep raisedAmt as float for financial calculations
    'round': 'category'       # Use 'category' to save space for the 'round' column
}

**Why this is efficient**:
Categorical data types reduce memory usage by internally representing string values as integers. For large datasets with repeated string values (e.g., `city`, `category`), this can significantly reduce memory consumption.

## 3. Loading Only Necessary Columns
Instead of loading the entire dataset, you can load only the specific columns that are necessary for your calculation or analysis. This avoids loading irrelevant data, which can save memory.

In [14]:
# List of columns to load (optimize by loading only the required columns)
columns_to_load = ['raisedAmt', 'round']

Why this is efficient:
By loading only the columns you need, you reduce the amount of data in memory. For example, if you only care about `raisedAmt` and `round`, there's no need to load columns like `company` or `city`.

In [15]:
# Generator function to process the data chunk by chunk with minimal memory usage
def series_a_funding_generator(file_path, chunk_size=10000):
    for chunk in pd.read_csv(file_path, chunksize=chunk_size, dtype=dtype_dict, usecols=columns_to_load):
        # Filter the chunk for Series A funding rounds
        yield chunk[chunk['round'] == 'a']

In [19]:
# Initialize the total amount and count of Series A funding rounds
total_series_a = 0
count_series_a = 0

## 4. Using Generators to Stream Data
A generator yields one piece of data at a time and doesn't hold all data in memory at once. This method ensures that intermediate results are not kept in memory after they've been processed.

In [20]:
# Process the file in chunks, summing the Series A funding amounts and counting rows
for chunk in series_a_funding_generator(file_path, chunk_size=chunk_size):
    total_series_a += chunk['raisedAmt'].astype(float).sum()
    count_series_a += chunk.shape[0]  # Count the number of Series A rows in the chunk

**Why this is efficient:**
A generator processes each chunk one at a time without storing all intermediate results in memory. This approach ensures that only the current chunk is loaded and processed, keeping memory usage to a minimum.

In [21]:
# Calculate the average Series A funding amount
average_series_a = total_series_a / count_series_a if count_series_a != 0 else 0

In [22]:
# Output the total and average Series A funding amounts
print(f"Total Series A Funding: {total_series_a}")
print(f"Average Series A Funding: {average_series_a}")

Total Series A Funding: 4380015000.0
Average Series A Funding: 7525798.969072165
