---

# Data Re-Indexing

Before we conduct analysis on thie data, we need to understand a crucial component of this data. Some days have more than 1 event, and other days have no events.

## Data Scoring: AvgTone

Let's try to understand how bad an event was. The column, "AvgTone" might be insightful.

Here's the documentation
> (numeric) This is the average “tone” of all documents containing one or more
> mentions of this event. The score ranges from -100 (extremely negative) to +100 (extremely
> positive). Common values range between -10 and +10, with 0 indicating neutral. This can be
> used as a method of filtering the “context” of events as a subtle measure of the importance of
> an event and as a proxy for the “impact” of that event. For example, a riot event with a slightly
> negative average tone is likely to have been a minor occurrence, whereas if it had an extremely
> negative average tone, it suggests a far more serious occurrence. A riot with a positive score
> likely suggests a very minor occurrence described in the context of a more positive narrative
> (such as a report of an attack occurring in a discussion of improving conditions on the ground in
> a country and how the number of attacks per day has been greatly reduced).
[![GDELT Data Format Codebook](https://img.shields.io/badge/GDELT%20Data%20Format%20Codebook-Download-blue)](http://data.gdeltproject.org/documentation/GDELT-Data_Format_Codebook.pdf)

Here it's clear that an event with a very negative value would be considered more **impactful**. I'm going to assume this is a proxy for more dangerious. The example mentioned supports this claim. Thus, the lower the "AvgTone," the worse the event was.


When we later analyze the AvgTone, we are going to want to investigate this on a daily basis. If there's more than 1 event, we'll take the sum of the AvgTone of all events on that day, to score that event. We'll call this Total AvgTone.

---

---

Let's pull in the data

---

In [48]:
import pandas as pd

# Load the data
data = pd.read_csv('csv/data_cleaned.csv')

# Display the first few rows of the dataframe
data.head()

Unnamed: 0,Date,EventCode,ActionGeo_FullName,ActionGeo_Lat,ActionGeo_Long,AvgTone
0,2024-08-23,145,"Union Park, Illinois, United States",41.8839,-87.6648,-3.046968
1,2024-08-22,145,"Union Park, Illinois, United States",41.8839,-87.6648,0.0
2,2024-08-20,145,"Union Park, Illinois, United States",41.8839,-87.6648,-4.319654
3,2024-08-20,145,"Union Park, Illinois, United States",41.8839,-87.6648,-4.319654
4,2024-06-27,145,"Buckingham Fountain, Illinois, United States",41.8756,-87.6189,-7.052186


In [49]:
# Load the target data
target_data = pd.read_csv('csv/target_data.csv')

# Display the first few rows of the target dataframe
target_data.head()

Unnamed: 0,Date,EventCode,ActionGeo_FullName,ActionGeo_Lat,ActionGeo_Long,AvgTone
0,2024-08-23,145,"Union Park, Illinois, United States",41.8839,-87.6648,-3.046968
1,2024-08-22,145,"Union Park, Illinois, United States",41.8839,-87.6648,0.0
2,2024-08-20,145,"Union Park, Illinois, United States",41.8839,-87.6648,-4.319654
3,2024-08-20,145,"Union Park, Illinois, United States",41.8839,-87.6648,-4.319654
4,2024-06-27,145,"Buckingham Fountain, Illinois, United States",41.8756,-87.6189,-7.052186


---

## Now let's reindex the data

---

---

All data

---

In [50]:
# Convert SQLDATE to datetime and aggregate data by date
data['Date'] = pd.to_datetime(data['Date'], format='%Y-%m-%d')
aggregated = data.groupby('Date').agg({'EventCode': 'size', 'AvgTone': 'sum'}).reset_index()
aggregated.columns = ['Date', 'Number of Events', 'Total AvgTone']

# Create a date range and reindex the dataframe
all_dates = pd.date_range(start=aggregated['Date'].min(), end=pd.Timestamp.today())
reindexed_data = aggregated.set_index('Date').reindex(all_dates, fill_value=0).reset_index()

# Rename columns
reindexed_data.columns = ['Date', 'Number of Events', 'Total AvgTone']

# Display the resulting DataFrame
reindexed_data


Unnamed: 0,Date,Number of Events,Total AvgTone
0,2015-01-08,1,2.007772
1,2015-01-09,0,0.000000
2,2015-01-10,0,0.000000
3,2015-01-11,0,0.000000
4,2015-01-12,0,0.000000
...,...,...,...
3636,2024-12-22,0,0.000000
3637,2024-12-23,0,0.000000
3638,2024-12-24,0,0.000000
3639,2024-12-25,0,0.000000


---

Only target data

---

In [51]:
# Rename 'Date' column to 'SQLDATE'
target_data.rename(columns={'Date': 'SQLDATE'}, inplace=True)

# Convert SQLDATE to datetime
target_data['SQLDATE'] = pd.to_datetime(target_data['SQLDATE'], format='%Y-%m-%d')

# Aggregate target_data by date
aggregated = target_data.groupby('SQLDATE').agg({'EventCode': 'size', 'AvgTone': 'sum'}).reset_index()
aggregated.columns = ['SQLDATE', 'Number of Events', 'Total AvgTone']

# Create a date range and reindex the dataframe
all_dates = pd.date_range(start=aggregated['SQLDATE'].min(), end=pd.Timestamp.today())
reindexed_target_data = aggregated.set_index('SQLDATE').reindex(all_dates, fill_value=0).reset_index()

# Rename columns
reindexed_target_data.columns = ['SQLDATE', 'Number of Events', 'Total AvgTone']

# Display the resulting DataFrame
reindexed_target_data


Unnamed: 0,SQLDATE,Number of Events,Total AvgTone
0,2015-12-25,1,-12.850954
1,2015-12-26,0,0.000000
2,2015-12-27,0,0.000000
3,2015-12-28,0,0.000000
4,2015-12-29,0,0.000000
...,...,...,...
3285,2024-12-22,0,0.000000
3286,2024-12-23,0,0.000000
3287,2024-12-24,0,0.000000
3288,2024-12-25,0,0.000000


---

## Save the reindexed data

---

In [52]:
# Save the reindexed data
reindexed_data.to_csv('csv/data_reindexed.csv', index=False)

# Save the reindexed target data
reindexed_target_data.to_csv('csv/target_data_reindexed.csv', index=False)