# Optimizing the Python Code for Big Data 
Balancing Coding Complexity against Computational Complexity 

    
    AUTHOR: Dr. Roy Jafari 

# Chapter 5: Picking up the right tool 

## Challenge 3: Restructuring and Reformulating Data – Second Case Study

In this challenge, we address a common issue in data preparation. Often, data is stored in formats that either omit zero-value objects or use lists or dictionaries within individual table cells to save disk space. While this method is efficient for storage, it can significantly complicate data preparation. This learning opportunity allows us to tackle these challenges and explore optimized tools for handling them. Follow the nine steps below to familiarize yourself with these challenges and learn how to manage them effectively.

1. The code below uses `pd.read_csv()` to load the `stock_news.csv` file into the `news_df` DataFrame. Execute the code and review the dataset:

```python
import pandas as pd
news_df = pd.read_csv('stock_news.csv')
news_df
```

2. The following code confirms that this data belongs to the year 2023.

```python
news_df.DateTime = pd.to_datetime(news_df['DateTime'])
print(news_df.DateTime.min())
print(news_df.DateTime.max())
```

3. Write a code snippet that extracts a unique list of all the stocks mentioned in the 'Entity' column of your dataframe. Challenge yourself to use the most efficient tools and techniques you can think of. Do not proceed to the next steps until you have completed this task. Once you are done, I will demonstrate the optimal tools and methods you could have used.

**Answer:**


4. The code below addresses the challenge described in Step 3, utilizing the following tools: the `.str` accessor, `.replace()`, `.split()`, and `.explode()` functions from pandas, along with Python's `set()` and `list()` functions. Execute the code and compare its runtime with the method you previously implemented.

```python
all_stocks = list(
    set(
        news_df.Entities
        .str[1:-1]
        .str.replace("'", "")
        .str.replace(" ", "")
        .str.split(',')
        .explode()
        .values.tolist()
    )
)
print(f"There are {len(all_stocks)} different stocks in news_df.")
all_stocks
```

**Answer:**


5. Each row of `news_df` represents one stock news. We want to use this dataset to create `sentiment_df`, where each row will represent the aggregate sentiment scores of each stock for each hour. The following code creates a Pandas DataFrame that is an empty version of `sentiment_df`. Run the following code to create `sentiment_df` and study its structure.

```python
import datetime

# Ensure the DateTime column is in datetime format
news_df['DateTime'] = pd.to_datetime(news_df['DateTime'])

# Generate all hours for the year 2023
all_hours = [
    datetime.datetime(2023, 1, 1) + datetime.timedelta(hours=i) 
    for i in range(365 * 24)]

# Create a MultiIndex using the list of stocks and all hours
my_multi_index = pd.MultiIndex.from_product(
    (all_stocks, all_hours), 
    names=['Ticker', 'DateTime'])

# Create an empty DataFrame with the specified MultiIndex and columns for sentiment scores
sentiment_df = pd.DataFrame(
    index=my_multi_index, 
    columns=['Positive', 'Negative', 'Neutral','n_news'])

# Display the structure of the empty DataFrame
sentiment_df
```

6. Your challenge in this step is to create the complete version of the `sentiment_df` DataFrame. Ensure you choose the most efficient tools and techniques to accomplish this task. Pay attention to whether you've picked the best tools to get this done. Once you are done, I will demonstrate the optimal tools and methods you could have used.

**Answer:**


7. The following code completes the challenge described in the previous step by utilizing the `.explode()` and `.groupby()` functions in an optimized way. Compare the runtime of your own code with this optimized version:
```python
%%time
news_df['Ticker'] = (news_df.Entities
        .str[1:-1]
        .str.replace("'", "")
        .str.replace(" ", "")
        .str.split(','))
sentiment_df = (news_df
                .explode('Ticker')
                .groupby(['Ticker', 'DateTime_Hour'])
                [['Positive', 'Negative', 'Neutral']]
                .mean()
            )
sentiment_df['n_news'] = (news_df
                .explode('Ticker')
                .groupby(['Ticker', 'DateTime_Hour'])
                .size()
            )
sentiment_df
```

**Answer:** 

8. The `sentiment_df` we get from the previous step does not have the exact data structure as the one described in step 5. Essentially, if during an hour we don't have any news for a ticker, the rows for that ticker have been omitted. Use the script that leverages what we completed in step 6, to ensure our `sentiment_df` will be complete. Pay attention to whether you've picked the best tools to get this done. Once you are done, I will demonstrate the optimal tools and methods you could have used.

**Answer:**


9. The following code snippet leverages the `.update()` function to address the challenge described in the previous step in an optimized manner. Please study and execute the code, then compare its runtime with your previous implementation:

```python
news_df['DateTime'] = pd.to_datetime(news_df['DateTime'])
all_hours = [
    datetime.datetime(2023, 1, 1) + 
    datetime.timedelta(hours=i) for i in range(365 * 24)]
my_multi_index = pd.MultiIndex.from_product(
    [all_stocks, all_hours], 
    names=['Ticker', 'DateTime_Hour'])
stage_df = pd.DataFrame(
    index=my_multi_index, 
    columns=['Positive', 'Negative', 'Neutral', 'n_news'])
stage_df['n_news'] = 0.0

# Update the DataFrame with values from sentiment_df
stage_df.update(sentiment_df)

# Copy the updated DataFrame to sentiment_df for further use
sentiment_df = stage_df.copy()

# Output the updated DataFrame
sentiment_df
```

**Answer**: 


10. For your future reference, it's useful to note the situations and tools from this challenge where optimization is most effective. The tools we used are `.explode()`, `pd.MultiIndex` and `.update()`.