## Google Trends

In [2]:
from pytrends.request import TrendReq
from pytrends import dailydata
import pandas as pd
import time

### Creating the list of keywords

In [68]:
# List of keywords will be here - can create a loop for this as well
#SOURCE: https://stockanalysis.com/list/dow-jones-stocks/
djia_tickers = [
 'MSFT',
 'AAPL', 
 'Visa', #Visa ticker is just V - google trends might not reflect that correctly so using Visa instead
 'JPM',
 'UNH',
 'WMT',
 'JNJ',
 'PG',
 'HD',
 'MRK',
 'CVX',
 'CRM',
 'KO',
 'MCD', 
 'CSCO', 
 'INTC',
 'DIS',
 'VZ',
 'AMGN',
 'IBM',
 'CAT',
 'NKE',
 'AXP',
 'HON',
 'BA',
 'GS',
 'MMM', 
 'TRV',
 'DOW',
 'WBA'] 


In [69]:
djia_names = [
    "Microsoft Corporation",
    "Apple Inc.",
    "Visa Inc.",
    "JPMorgan Chase & Co.",
    "UnitedHealth Group Incorporated",
    "Walmart Inc.",
    "Johnson & Johnson",
    "The Procter & Gamble Company",
    "The Home Depot, Inc.",
    "Merck & Co., Inc.",
    "Chevron Corporation",
    "Salesforce, Inc.",
    "The Coca-Cola Company",
    "McDonald's Corporation",
    "Cisco Systems, Inc.",
    "The Walt Disney Company",
    "Intel Corporation",
    "International Business Machines Corporation",
    "Verizon Communications Inc.",
    "Caterpillar Inc.",
    "NIKE, Inc.",
    "Amgen Inc.",
    "American Express Company",
    "Honeywell International Inc.",
    "The Boeing Company",
    "The Goldman Sachs Group, Inc.",
    "3M Company",
    "The Travelers Companies, Inc.",
    "Dow Inc.",
    "Walgreens Boots Alliance, Inc."
]

In [70]:
len(djia_names)

30

### Creating a function to call Google Trends for list of keywords

In [106]:
def google_trends(inputlist, start_year):
    combined_df = pd.DataFrame()  # Empty DataFrame to store combined data
    list_name = [name for name, value in globals().items() if value is inputlist][0]

    for string in inputlist:
            # Define the file name with .csv extension
        file_name = f"{string}.csv"
        
        retries = 0
        max_retries = 5
        while retries < max_retries:
            try:
                # Make the request to Google Trends API
                data = dailydata.get_daily_data(string, start_year, 1, 2023, 12, geo='US').reset_index()

                data[['date', f'{string}']].to_csv(file_name, index=False) #we only need the two columns 
                
                #print(f"CSV file '{file_name}' has been created.")
                df = pd.read_csv(file_name, index_col='date')
                
                # Append the DataFrame to the combined DataFrame
                combined_df = pd.concat([combined_df, df])

                break  # Exit the retry loop if successful
            except Exception as e:
                if "429" in str(e):
                    # Backoff strategy
                    wait_time = 2**retries
                    print(f"Too many requests. Retrying in {wait_time} seconds...")
                    time.sleep(wait_time)
                    retries += 1
                else:
                    print(f"An error occurred: {e}")
                    break  # Exit the retry loop if it's not a 429 error
        else:
            print("Max retries reached. Unable to fetch data.")


    #file_name = inputlist.csv
    combined_file_name = f"{list_name}.csv"
    combined_df.to_csv(combined_file_name, index=True)
    #print(f"Combined CSV file '{combined_file_name}' has been created.")

    return f"Combined CSV file '{combined_file_name}' has been created."


In [102]:
test = ['hello']
google_trends(test, 2023)

hello:2023-01-01 2023-01-31
CSV file 'hello.csv' has been created.
Combined CSV file 'test.csv' has been created.


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  complete[f'{word}_monthly'].ffill(inplace=True)  # fill NaN values


'List completed'

In [103]:
test = pd.read_csv("test.csv")
test.head()

Unnamed: 0,date,hello
0,2023-01-01,73.96
1,2023-01-02,94.09
2,2023-01-03,73.96
3,2023-01-04,90.25
4,2023-01-05,88.36


### Checking the CSV files from the function as pandas dataframes

In [110]:
#CHECKING IF CSV FILES DOWNLOADED PROPERLY 

df_AAPL = pd.read_csv("AAPL.csv", index_col=0)
df_AAPL

Unnamed: 0_level_0,AAPL_unscaled,AAPL_monthly,isPartial,scale,AAPL
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2018-01-01,12,31.0,False,0.31,3.72
2018-01-02,56,31.0,,0.31,17.36
2018-01-03,62,31.0,,0.31,19.22
2018-01-04,56,31.0,,0.31,17.36
2018-01-05,61,31.0,,0.31,18.91
...,...,...,...,...,...
2023-12-27,79,22.0,,0.22,17.38
2023-12-28,72,22.0,,0.22,15.84
2023-12-29,59,22.0,,0.22,12.98
2023-12-30,17,22.0,,0.22,3.74


Columns explained: 

1. AAPL_unscaled:
2. AAPL_Monthly
3. isPartial: This column indicates whether the data for the corresponding date is complete or partial. If the value is "True", it means the data for that date is incomplete and subject to revision. If the value is "False", it means the data is complete for that date.
4. scale
5. AAPL

The column named after the word argument contains the daily search volume already scaled and comparable through time. The column f'{word}_unscaled' is the original daily data fetched month by month, and it is not comparable across different months (but is comparable within a month).The column f'{word}_monthly' contains the original monthly datafetched at once. The values in this column have been backfilled so that there are no NaN present.The column 'scale' contains the scale used to obtain the scaleddaily data.

## Documentation:

### How is Google Trends data normalized?

Google Trends normalizes search data to make comparisons between terms easier. Search results are normalized to the time and location of a query by the following process:

Each data point is divided by the total searches of the geography and time range it represents to compare relative popularity. Otherwise, places with the most search volume would always be ranked highest.

The resulting numbers are then scaled on a range of 0 to 100 based on a topic’s proportion to all searches on all topics.

Different regions that show the same search interest for a term don't always have the same total search volumes.