# Google Trends API Data
## - Introduction:

#### Primary Objective:
The main goal of collecting Google Trends data for three distinct regions—Saudi Arabia, India, and the United Arab Emirates—is to analyze the interest in social media and mental health across these countries over specific time periods between 2020 and 2024. This analysis will help to understand how interest in social media evolves and its potential impact on mental health in these regions over time.

#### Secondary Objectives:

##### - Trend Analysis Over Time:
We aim to determine whether there is an increase or decrease in interest in social media and mental health in each region over the years. This will help identify any emerging patterns or changes in public concern related to these topics.

##### - Cross-Country Comparison:
This analysis will enable us to identify differences in interest between Saudi Arabia, the UAE, and India. For instance, do Arab countries exhibit similar trends regarding social media use and mental health compared to India? These comparisons can highlight regional variations.

##### - Drawing Conclusions About the Potential Impact of Social Media:
Based on the data, we will be able to form preliminary conclusions about the relationship between increased social media usage and the level of interest in mental health in each country. This may offer insights into how digital engagement correlates with mental health awareness or concern in different cultural contexts.

#### Methodology:

##### Keyword Selection:
We identified two groups of relevant keywords associated with the research topic. These keywords include popular social media platforms like Instagram, Twitter, and Facebook, along with terms such as "social media" and "mental health." This selection aims to measure public interest in these topics through Google Trends data.

##### Timeframe Definition:
The data has been segmented on an annual basis, from 2020 to 2024. This approach enables comparisons between different years and helps to observe trends that emerge or shift over time.

##### Geographic Regions:

- Saudi Arabia: <br> Chosen as our primary target audience, it represents a key region for understanding trends within the Arab world.
- United Arab Emirates:<br> The UAE was selected as it is the Arab nation with the highest usage of technology and social media, making it an important sample for studying trends in the region.
- India:<br> India is the second-largest user of social media globally, after China. However, China was excluded from this analysis due to unavailability of data in Google Trends, since different apps are predominantly used there. India's inclusion offers a comparative perspective on external trends.



## - Source of Dataset:
The dataset was sourced from Google Trends, accessible via the following link: [Google Trends.](https://trends.google.com/trends/)

The data was retrieved programmatically using the 'pytrends' library, which provides a Python interface to interact with the Google Trends API. 

## - Steps:

#### 1. Installation
Before running the script to collect data from Google Trends, the necessary Python library, pytrends, had to be installed. This was done using the command-line interface (CMD) on the computer. To install the pytrends library, the following command was executed:

In [None]:
pip install pytrends

This command utilizes Python’s package manager, pip, to download and install the pytrends library. The CMD (Command Prompt) was used to ensure that the library was available in the environment, allowing the script to make requests to Google Trends and retrieve the required data.

Once the installation was successfully completed, the script was ready to run without any issues related to missing dependencies.

#### 2. Importing Necessary Libraries

In [None]:
from pytrends.request import TrendReq
import pandas as pd
import time

- TrendReq: This is used to make requests to the Google Trends API via the pytrends library.
- pandas: A powerful data manipulation library that allows for easy handling of tabular data (data frames).
- time: This module is used to introduce delays between API requests to avoid overloading the server.

#### 3. Setting up Google Trends Request

In [None]:
pytrends = TrendReq(hl='en-US', tz=360)

- TrendReq: Initializes the connection to Google Trends.
- hl='en-US': Specifies the language as English (US).
- tz=360: Sets the timezone (360 is GMT +6:00).

#### 4. Defining Keyword Groups

In [None]:
keywords_group1 = ["social media", "mental health", "Instagram", "Twitter", "Facebook"]
keywords_group2 = ["Snapchat", "TikTok", "LinkedIn", "YouTube", "WhatsApp"]

Two groups of keywords are created, each containing relevant terms for social media and mental health. These keywords will be used to gather data on user interest in each platform/topic over time. 
<br>
The keywords were divided into two groups for several practical reasons:
<br>
- Google Trends Limitation: Google Trends imposes a limit on the number of keywords that can be queried simultaneously. By splitting the keywords into two groups, we can gather data for a larger set of terms without exceeding the platform's constraints.

- Data Accuracy: By splitting the terms, we can ensure more accurate and focused data collection. Querying too many terms at once can dilute the relevance of the data, especially when comparing distinct platforms or topics.

This separation makes it easier to conduct a detailed analysis of trends for both social media platforms and mental health topics without overwhelming the system.








#### 5. Specifying Time Ranges for Each Year

In [None]:
years = {
    '2020': '2020-01-01 2020-12-31',
    '2021': '2021-01-01 2021-12-31',
    '2022': '2022-01-01 2022-12-31',
    '2023': '2023-01-01 2023-12-31',
    '2024': '2024-01-01 2024-12-31'
}

The dictionary years specifies the time range for each year from 2020 to 2024 because the Google Trends API does not allow for direct yearly data retrieval. Instead, the API requires precise date ranges that include specific months and days. By defining each year with exact start and end dates (from January 1st to December 31st), we can simulate yearly data collection.

This step was necessary because the API only accepts data requests with monthly and daily granularity, not by year. Therefore, creating these specific date ranges ensures that we can retrieve data for an entire year without gaps.

#### 6. Setting Target Countries

In [None]:
countries = ['SA', 'IN', 'AE']

A list of country codes (SA for Saudi Arabia, IN for India, AE for the United Arab Emirates) is defined. Data will be collected for these countries.

#### 7. Function to Fetch Data for Each Country

In [None]:
def fetch_data_for_country(country, keywords_group1, keywords_group2):
    country_data = pd.DataFrame()  
    for year, timeframe in years.items():
        for keywords in [keywords_group1, keywords_group2]:
            try:
                pytrends.build_payload(keywords, cat=0, timeframe=timeframe, geo=country, gprop='')
                data = pytrends.interest_over_time()
                if not data.empty:
                    data['Year'] = year  
                    data['Country'] = country  
                    country_data = pd.concat([country_data, data], axis=0)
                else:
                    print(f"No data available for {year} with keywords {keywords} in {country}")
            except Exception as e:
                print(f"Error fetching data for {year} with keywords {keywords} in {country}: {e}")
            time.sleep(30)
    return country_data

- Parameters:
  - country:<br>
The country for which data is being fetched.
  - keywords_group1 & keywords_group2:<br>
The two groups of keywords to be used in the Google Trends queries.
- Process:<br>
For each country and each year, the function uses the build_payload() method to request Google Trends data.
The data is gathered using the interest_over_time() function, which returns interest over time for the specified keywords.
If the data is not empty, the year and country are added as columns to the dataset.
The function catches any errors and waits 30 seconds between requests (time.sleep(30)) to avoid overloading the API.

#### 8. Loop to Fetch Data for All Countries

In [None]:
all_data = pd.DataFrame()  

for country in countries:
    print(f"Fetching data for {country}...")
    country_data = fetch_data_for_country(country, keywords_group1, keywords_group2)
    all_data = pd.concat([all_data, country_data], axis=0)

The loop goes through each country in the countries list, fetching data for each one and concatenating it to a single DataFrame all_data.

#### 9. Resetting the Index

In [None]:
all_data.reset_index(inplace=True)

Once all the data is collected, the index of the DataFrame is reset to ensure the rows are properly numbered and organized.


#### 10. Summing Data by Year and Country

In [None]:
if 'Year' in all_data.columns:
    yearly_data = all_data.groupby(['Year', 'Country'])[keywords_group1 + keywords_group2].sum().reset_index()

This checks if the data includes the Year column and, if so, groups the data by year and country, summing up the interest values for all the keywords.

#### 11. Adding a Date Column

In [None]:
selected_columns = ['date', 'Year', 'Country'] + keywords_group1 + keywords_group2
yearly_data = yearly_data[selected_columns]

This step ensures that the final DataFrame contains only the necessary columns, which include date, Year, Country, and the keywords.

#### 12. Saving Data to CSV

In [None]:
yearly_data.to_csv("GoogleTrends_Data.csv", index=False)

Finally, the cleaned and processed data is saved to a CSV file called social_media_mental_health_trends_multiple_countries.csv.

#### 13. Handling Empty Data

In [None]:
else:
    print("No data was collected.")

If no data was collected (i.e., the Year column does not exist), an error message is printed indicating that no data was fetched.

## - Operations and Decisions:
#### - Collection Methods:
We employed the pytrends library to systematically retrieve data from Google Trends. This method enabled us to capture public interest trends related to social media and mental health across multiple countries. Specifically, we used keyword searches within predefined timeframes for Saudi Arabia, India, and the UAE, which provided us with region-specific data.

#### - Processing and Cleaning Tasks:
During the data collection process, we implemented several cleaning and verification steps to ensure the integrity and completeness of our dataset.<br>
1- Data Availability Check:<br>After each data retrieval attempt, we confirmed that the returned dataset was not empty. If no data was available for a specific year or keyword group, we logged a message for transparency in our collection process.

2- Adding Relevant Columns:<br> We added two new columns—"Year" and "Country"—to each retrieved dataset. This organization enabled easier trend analysis based on these dimensions.

3- Data Concatenation:<br> We combined the retrieved data for each country into a single DataFrame using the pd.concat() function. This merged results from multiple requests while maintaining dataset integrity.

4- Resetting Index:<br> After concatenation, we reset the index of the final DataFrame with reset_index(inplace=True) to ensure a clean, continuous index for easier manipulation.

5- Grouping Data:<br>Once we had the complete dataset, we grouped it by "Year" and "Country" to aggregate the keyword interest values using the groupby() and sum() functions, providing a summarized view of interest across keywords for each year and country.

6- Column Selection:<br> After aggregation, we retained specific columns, including "date," "Year," "Country," and all keyword columns, streamlining the dataset for easier analysis and visualization.

7- Handling Missing Data:<br>We monitored the dataset for any missing or NaN values. Identifying these gaps was crucial for ensuring the accuracy of our findings, although the specifics of addressing them will depend on the final analysis.

By following these steps, we ensured that our dataset was comprehensive, organized, and ready for further analysis of social media and mental health trends across the selected countries.

#### - Decisions Made:

- Keyword Grouping:<br> Due to API limitations, we divided the keywords into two groups. This was necessary to ensure the API could handle the requests efficiently and avoid errors.
- Yearly Timeframes:<br> Since Google Trends only allows data retrieval based on specific date ranges (down to months and days), we manually defined the yearly ranges (e.g., '2020-01-01 to 2020-12-31') to capture trends over entire years. This allowed us to compare year-on-year trends for each keyword in each country.
- Separate Data Processing for Each Country:<br> To maintain accuracy and manage API limitations, we processed the data for each country individually. This approach helped ensure that each country’s trends were collected correctly and allowed us to focus on the unique patterns in each region.

## - Challenges
- Accessing Google Responses:<br>
A significant challenge we encountered was managing multiple requests to the Google Trends API without exceeding the rate limits. To avoid server overload and ensure consistent data retrieval, we implemented a time.sleep(30) command to introduce a delay between each request. This step was crucial for maintaining stable API communication and avoiding request failures due to rate limits.

- Timeframe Specification:<br>
The Google Trends API does not allow direct yearly data retrieval and instead only supports queries based on months and days. To overcome this, we manually defined exact date ranges for each year (e.g., January 1st to December 31st). This allowed us to gather data for entire years, ensuring that the data was structured according to our yearly analysis needs. The process required aggregating monthly data into yearly summaries to align with our research objectives.

- Keyword Grouping Limitations:<br>
The API enforces a limit on the number of keywords that can be queried at once. This limitation led us to divide the keywords into two distinct groups, with each group containing relevant social media and mental health terms. Although this increased the complexity of the data retrieval process, we combined the results during post-processing to ensure all keywords were included in the analysis. This required careful data management and merging.

- Single Country Requests:<br>
The Google Trends API only supports data extraction for one country per request. To manage this, we developed a function that iterated through the countries (Saudi Arabia, India, and the UAE) and collected data separately for each. This significantly increased the time required for data collection but ensured that each country's trends were handled independently, maintaining the integrity and focus of the research.

## - A Bias and Fairness Report:

#### 	•	 Summary of Research on Data Bias and Fairness: 
Our analysis centered on a specific set of keywords that, while relevant, may not encompass all aspects of social media's impact on mental health. This focus could limit the broader applicability of our findings. Additionally, the underrepresentation of certain countries in the data may constrain the analysis. This means that some regions or populations might be overlooked, potentially leading to skewed insights that do not fully reflect global trends or the diverse ways social media impacts mental health across different contexts.

#### 	• Evaluation of Dataset's Potential Biases:  
We identified several potential biases in our dataset. The limited range of keywords might not fully represent all trends related to social media and mental health, which can skew our conclusions. Additionally, there is a risk of underrepresentation of certain countries, as differences in internet access, search behavior, and cultural factors can lead to uneven data availability across regions. These biases may further limit the accuracy and fairness of our analysis.

####	• Implications of Biases:   
The biases in our data collection could significantly affect the credibility and reliability of our research conclusions. If certain trends or demographics are underrepresented, it could lead to misinterpretations of social media's impact on mental health.

####	• Recommendations for Mitigating Biases:   
To enhance the robustness of our research, we recommend expanding the range of keywords used to encompass a wider array of social media platforms and mental health terms. Additionally, collecting data in multiple languages and increasing the scope of the study to include more diverse populations will help provide a more comprehensive understanding of the subject matter.