# Google Trends API Data
## - Introduction:

#### Primary Objective:
The main goal of collecting Google Trends data for three distinct regions—Saudi Arabia, India, and the United Arab Emirates—is to analyze the interest in social media and mental health across these countries over specific time periods between 2020 and 2024. This analysis will help to understand how interest in social media evolves and its potential impact on mental health in these regions over time.

#### Secondary Objectives:

##### - Trend Analysis Over Time:
We aim to determine whether there is an increase or decrease in interest in social media and mental health in each region over the years. This will help identify any emerging patterns or changes in public concern related to these topics.

##### - Cross-Country Comparison:
This analysis will enable us to identify differences in interest between Saudi Arabia, the UAE, and India. For instance, do Arab countries exhibit similar trends regarding social media use and mental health compared to India? These comparisons can highlight regional variations.

##### - Drawing Conclusions About the Potential Impact of Social Media:
Based on the data, we will be able to form preliminary conclusions about the relationship between increased social media usage and the level of interest in mental health in each country. This may offer insights into how digital engagement correlates with mental health awareness or concern in different cultural contexts.

#### Methodology:

##### Keyword Selection:
We identified two groups of relevant keywords associated with the research topic. These keywords include popular social media platforms like Instagram, Twitter, and Facebook, along with terms such as "social media" and "mental health." This selection aims to measure public interest in these topics through Google Trends data.

##### Timeframe Definition:
The data has been segmented on an annual basis, from 2020 to 2024. This approach enables comparisons between different years and helps to observe trends that emerge or shift over time.

##### Geographic Regions:

- Saudi Arabia: <br> Chosen as our primary target audience, it represents a key region for understanding trends within the Arab world.
- United Arab Emirates:<br> The UAE was selected as it is the Arab nation with the highest usage of technology and social media, making it an important sample for studying trends in the region.
- India:<br> India is the second-largest user of social media globally, after China. However, China was excluded from this analysis due to unavailability of data in Google Trends, since different apps are predominantly used there. India's inclusion offers a comparative perspective on external trends.



## - Source of Dataset:
The dataset was sourced from Google Trends, accessible via the following link: [Google Trends.](https://trends.google.com/trends/)

The data was retrieved programmatically using the 'pytrends' library, which provides a Python interface to interact with the Google Trends API. 

## - Attributes’ description table:

| Column Name        | Description                                                             | Data Type     | Possible Values                      |
|:-------------------|:------------------------------------------------------------------------|:-------------:|:-------------------------------------:|
| `date`             | Represents the date associated with the data entry.                     | Object        | Continuous dates (e.g., '2020-01-01') |
| `Year`             | Indicates the year of the data entry.                                   | Numeric       | Integer values (e.g., 2020, 2021)   |
| `Country`          | Represents the country for which the data is collected.                 | Categorical   | Country codes (e.g., 'AE', 'IN', 'SA') |
| `social media`     | Indicates the overall interest in social media for the year and country.| Numeric       | Continuous numeric values             |
| `mental health`    | Indicates the overall interest in mental health for the year and country.| Numeric       | Continuous numeric values             |
| `Instagram`        | Represents the interest in Instagram for the year and country.          | Numeric       | Continuous numeric values             |
| `Twitter`          | Represents the interest in Twitter for the year and country.            | Numeric       | Continuous numeric values             |
| `Facebook`         | Represents the interest in Facebook for the year and country.           | Numeric       | Continuous numeric values             |
| `Snapchat`         | Represents the interest in Snapchat for the year and country.           | Numeric       | Continuous numeric values             |
| `TikTok`           | Represents the interest in TikTok for the year and country.             | Numeric       | Continuous numeric values             |
| `LinkedIn`         | Represents the interest in LinkedIn for the year and country.           | Numeric       | Continuous numeric values             |
| `YouTube`          | Represents the interest in YouTube for the year and country.            | Numeric       | Continuous numeric values             |
| `WhatsApp`         | Represents the interest in WhatsApp for the year and country.           | Numeric       | Continuous numeric values             |


## - Steps:

#### 1. Installation
Before running the script to collect data from Google Trends, the necessary Python library, pytrends, had to be installed. This was done using the command-line interface (CMD) on the computer. To install the pytrends library, the following command was executed:

In [None]:
pip install pytrends

This command utilizes Python’s package manager, pip, to download and install the pytrends library. The CMD (Command Prompt) was used to ensure that the library was available in the environment, allowing the script to make requests to Google Trends and retrieve the required data.

Once the installation was successfully completed, the script was ready to run without any issues related to missing dependencies.

#### 2. Importing Necessary Libraries

In [10]:
from pytrends.request import TrendReq
import pandas as pd
import time
import random
from pytrends.exceptions import ResponseError

In this code, we imported the necessary libraries and tools for working with Google Trends and data. Here's a breakdown of what we did:

1. from pytrends.request import TrendReq:
We imported TrendReq from pytrends, allowing us to access Google Trends data. This is the main object used to send search queries to Google Trends.
2. import pandas as pd:
We imported Pandas, a popular data analysis library, to organize and manipulate the data using DataFrames for easy analysis.
3. import time:
We imported time to manage time-based functions like delaying execution between requests to avoid overwhelming Google with too many queries at once.
4. import random:
We imported random to introduce random delays between requests, ensuring that our requests appear natural and avoid triggering any rate limits.
5. from pytrends.exceptions import ResponseError:
We imported ResponseError to handle potential errors from Google Trends, allowing us to retry requests or manage failures gracefully.

#### 3. Setting up Google Trends Request

In [11]:
pytrends = TrendReq(hl='en-US', tz=360, timeout=(10, 25))

we are initializing a TrendReq object from the pytrends library with some specific parameters:

- hl='en-US':

This sets the language for the Google Trends results to English (United States). The hl parameter stands for "host language."
- tz=360:

This sets the time zone to UTC+6 hours (360 minutes). Time zones are specified in minutes from UTC. For example, 0 is UTC, and 360 corresponds to UTC+6.
- timeout=(10, 25):

This sets a timeout for the connection. It consists of two values:
10 seconds for the connection to be established.
25 seconds for reading the data once connected.

These timeouts help ensure that the code doesn't hang indefinitely if there's a slow response from Google Trends.


By creating this pytrends object, we are now ready to send search queries to Google Trends with the specified language, time zone, and connection settings.

#### 4. Defining Keyword Groups

In [12]:
keywords_group1 = ["social media", "mental health", "Instagram", "Twitter", "Facebook"]
keywords_group2 = ["Snapchat", "TikTok", "LinkedIn", "YouTube", "WhatsApp"]

Two groups of keywords are created, each containing relevant terms for social media and mental health. These keywords will be used to gather data on user interest in each platform/topic over time. 
<br>
The keywords were divided into two groups for several practical reasons:
<br>
- Google Trends Limitation: Google Trends imposes a limit on the number of keywords that can be queried simultaneously. By splitting the keywords into two groups, we can gather data for a larger set of terms without exceeding the platform's constraints.

- Data Accuracy: By splitting the terms, we can ensure more accurate and focused data collection. Querying too many terms at once can dilute the relevance of the data, especially when comparing distinct platforms or topics.

This separation makes it easier to conduct a detailed analysis of trends for both social media platforms and mental health topics without overwhelming the system.







#### 5. Specifying Time Ranges for Each Year

In [13]:
years = {
    '2020': '2020-01-01 2020-12-31',
    '2021': '2021-01-01 2021-12-31',
    '2022': '2022-01-01 2022-12-31',
    '2023': '2023-01-01 2023-12-31',
    '2024': '2024-01-01 2024-12-31'
}

The dictionary years specifies the time range for each year from 2020 to 2024 because the Google Trends API does not allow for direct yearly data retrieval. Instead, the API requires precise date ranges that include specific months and days. By defining each year with exact start and end dates (from January 1st to December 31st), we can simulate yearly data collection.

This step was necessary because the API only accepts data requests with monthly and daily granularity, not by year. Therefore, creating these specific date ranges ensures that we can retrieve data for an entire year without gaps.

#### 6. Setting Target Countries

In [14]:
countries = ['SA', 'IN', 'AE']

A list of country codes (SA for Saudi Arabia, IN for India, AE for the United Arab Emirates) is defined. Data will be collected for these countries.

#### 7. Function to Fetch Google Trends Data 

In [15]:
def fetch_data_with_retry(country, keywords, year, timeframe, retries=3):
    for attempt in range(retries):
        try:
            pytrends.build_payload(keywords, cat=0, timeframe=timeframe, geo=country, gprop='')
            data = pytrends.interest_over_time()
            return data
        except ResponseError as e:
            if "429" in str(e):
                print(f"Rate limit exceeded for {country}, retrying... ({attempt + 1}/{retries})")
                time.sleep(60 * (attempt + 1))  
            else:
                print(f"Other error occurred: {e}")
                break
        except Exception as e:
            print(f"Error: {e}")
            break
    return pd.DataFrame() 

In this function, fetch_data_with_retry, we're creating a mechanism to fetch Google Trends data with a retry feature in case of errors, particularly when rate limits are exceeded. Here's a breakdown of the function:

- Parameters:
  - country: The country code (e.g., 'US', 'SA') for which we want to fetch data.
  - keywords: A list of keywords to search for (e.g., ['social media', 'mental health']).
  - year: The year of interest (used within the timeframe).
  - timeframe: The timeframe for the query (e.g., '2020-01-01 2021-01-01').
  - retries: The number of times to retry fetching data if it fails (default is 3 retries).

- What the function does:
  - for attempt in range(retries):

This loop allows us to try fetching the data up to retries times if something goes wrong.
  - pytrends.build_payload:

This is where we send the actual search query to Google Trends using the keywords, country, and timeframe.
  - pytrends.interest_over_time():

After building the search payload, we fetch the interest over time data. If successful, the data is returned.
  - except ResponseError:

If there's a rate limit error (Google limits the number of requests), the exception checks for a 429 error.
If the error occurs, we print a message and wait longer between retries. The delay increases with each retry.
  - except Exception:

Any other general errors (besides rate limit) are caught here and printed, and the process stops.
  - return pd.DataFrame():

If all retries fail or another error occurs, the function returns an empty DataFrame.
- Purpose:

The purpose of this function is to handle situations where Google Trends rate limits requests (error 429) by retrying the request with increasing wait times, while gracefully handling other errors. This makes the data-fetching process more reliable, especially when dealing with large datasets or frequent requests.

#### 8. Function to Data Collection for Countries 

In [16]:

def fetch_data_for_country(country, keywords_group1, keywords_group2):
    country_data = pd.DataFrame()
    for year, timeframe in years.items():
        for keywords in [keywords_group1, keywords_group2]:
            print(f"Fetching data for {year} with keywords {keywords} in {country}...")
            data = fetch_data_with_retry(country, keywords, year, timeframe)
            if not data.empty:
                data['Year'] = year
                data['Country'] = country
                country_data = pd.concat([country_data, data], axis=0)
            else:
                print(f"No data available for {year} with keywords {keywords} in {country}")
            time.sleep(random.uniform(5, 15))  
    return country_data


all_data = pd.DataFrame()

- Function: fetch_data_for_country
This function retrieves Google Trends data for a given country using two different groups of keywords. The data is fetched for multiple years and then concatenated into a single DataFrame.

- Parameters:
    - country: The country code (e.g., 'US', 'SA') for which data is being fetched.
    - keywords_group1: The first set of keywords to search for.
    - keywords_group2: The second set of keywords to search for.
    
- Steps:
    - Create an empty DataFrame (country_data):

This will store all the data for the country being processed.
- Loop over years:

The years variable contains a mapping of years to their corresponding timeframes.
For each year, the function fetches data using both keywords_group1 and keywords_group2.
- Call the fetch_data_with_retry function:

This function is called for each set of keywords (keywords_group1 and keywords_group2), passing in the country, keywords, year, and timeframe.
It handles retries and rate limit issues when making requests to Google Trends.

- Check if data was fetched:

- If data is successfully fetched (data is not empty), it adds two new columns:
    - 'Year': The year for which the data was fetched.  
    - 'Country': The country for which the data was fetched.
The new data is concatenated to the existing country_data.
- Delay with random sleep:

To avoid being blocked by Google Trends for making too many requests in a short time, a random delay (between 5 and 15 seconds) is introduced between requests.
- Return country_data:

After all data for the country and both keyword groups have been fetched for all years, the data is returned.

#### 9. Loop to Fetch and Concatenate Data for Each Country

In [None]:
for country in countries:
    print(f"Fetching data for {country}...")
    country_data = fetch_data_for_country(country, keywords_group1, keywords_group2)
    all_data = pd.concat([all_data, country_data], axis=0)

all_data.reset_index(inplace=True, drop=True)

- Fetching Data for Multiple Countries:

We iterate over a list of countries (countries) to retrieve Google Trends data for each country.
For each country, we call the fetch_data_for_country function to get data using two sets of keywords.
The data for each country is then concatenated into a unified DataFrame (all_data).
- Resetting the Index:

After merging all the data, we reset the index in all_data to ensure it is sequential, making it easier to work with.
- Result:
The result is a DataFrame (all_data) that contains Google Trends data for all countries, with a neatly organized index ready for analysis.

#### 10. Grouping Data and Exporting to CSV"

In [18]:
if 'Year' in all_data.columns:
    
    yearly_data = all_data.groupby(['Year', 'Country'])[keywords_group1 + keywords_group2].sum().reset_index()

   
    yearly_data['date'] = yearly_data['Year']


    selected_columns = ['date', 'Year', 'Country'] + keywords_group1 + keywords_group2
    yearly_data = yearly_data[selected_columns]


    print(yearly_data.head())


    yearly_data.to_csv("ِAPIGoogleTrends_data.csv", index=False)
else:
    print("No data was collected.")

   date  Year Country  social media  mental health  Instagram  Twitter  \
0  2020  2020      AE         148.0           56.0     2495.0   2352.0   
1  2020  2020      IN          57.0            5.0     3856.0    845.0   
2  2020  2020      SA          92.0            2.0     1574.0   2404.0   
3  2021  2021      AE         174.0           67.0     3129.0   2289.0   
4  2021  2021      IN          57.0            2.0     4255.0    735.0   

   Facebook  Snapchat  TikTok  LinkedIn  YouTube  WhatsApp  
0    4442.0     106.0   274.0     422.0   4642.0    2810.0  
1    4228.0      54.0   190.0     162.0   4500.0    2761.0  
2    3904.0     216.0   157.0     208.0   4302.0    2160.0  
3    4154.0     113.0   448.0     470.0   4448.0    3563.0  
4    3003.0      74.0    67.0     184.0   4143.0    4143.0  


1. Check for 'Year' Column:

The code first checks if the 'Year' column exists in the all_data DataFrame to ensure that there is data to process.

2. Group Data by Year and Country:

If the column exists, it groups the data by 'Year' and 'Country', summing the values of the specified keywords from keywords_group1 and keywords_group2.
The result is stored in yearly_data, and the index is reset for easier manipulation.

3. Add a 'date' Column:

A new column named 'date' is added to yearly_data, which is set to the values of the 'Year' column.

4. Select Required Columns for Export:

It defines which columns to keep in the final DataFrame for export, including 'date', 'Year', 'Country', and the keywords from both groups.

5. Display the Aggregated Data:

The first few rows of the aggregated data (yearly_data) are printed to the console for verification.

6. Save the Data to a CSV File:

Finally, the aggregated data is saved to a CSV file named "social_media_mental_health_trends_multiple_countries.csv" without including the index.

7. Handle Missing Data:

If no data was collected (i.e., if the 'Year' column is not present), it prints a message indicating that no data is available.

#### 11. Checking for Missing Values

In [20]:
print(yearly_data.isnull().sum())


date             0
Year             0
Country          0
social media     0
mental health    0
Instagram        0
Twitter          0
Facebook         0
Snapchat         0
TikTok           0
LinkedIn         0
YouTube          0
WhatsApp         0
dtype: int64


The results indicate that there are no missing values in any of the columns, including 'social media', 'mental health', and various social media platforms (Instagram, Twitter, Facebook, etc.). This is due to the earlier code implementation that ensures data integrity by managing retries and handling errors when fetching data. The absence of missing values confirms that the data collection process successfully gathered complete datasets without interruptions. The strategy of handling errors, including rate limits, helped avoid missing data, ensuring that each data entry is fully populated.

#### 12. Checking for Outliers

In [24]:


def detect_outliers_iqr(data):
    outlier_dict = {}
    
    for column in data.select_dtypes(include=['float64', 'int64']).columns:

        Q1 = data[column].quantile(0.25)
        Q3 = data[column].quantile(0.75)
        IQR = Q3 - Q1

        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR

        outliers = data[(data[column] < lower_bound) | (data[column] > upper_bound)]
        
        if not outliers.empty:
            outlier_dict[column] = outliers
        else:
            outlier_dict[column] = 0 

    return outlier_dict

outliers_dict = detect_outliers_iqr(yearly_data)

for column, outliers in outliers_dict.items():
    if isinstance(outliers, pd.DataFrame):
        print(f"Outliers in column '{column}':")
        print(outliers)
    else:
        print(f"Outliers in column '{column}': 0")
    print("\n")


Outliers in column 'social media': 0


Outliers in column 'mental health': 0


Outliers in column 'Instagram': 0


Outliers in column 'Twitter': 0


Outliers in column 'Facebook': 0


Outliers in column 'Snapchat': 0


Outliers in column 'TikTok': 0


Outliers in column 'LinkedIn': 0


Outliers in column 'YouTube': 0


Outliers in column 'WhatsApp': 0




The results indicate that there are no outliers in any of the columns, including 'social media,' 'mental health,' and the various social media platforms (such as Instagram, Twitter, Facebook, etc.). This shows that the collected data is consistent and falls within the expected range, meaning there are no abnormal or extreme data points that could affect the analysis. The code used previously, which includes retry logic and random delays between requests, helped in gathering reliable and accurate data without any unexpected outliers.

## - Operations and Decisions:
#### - Collection Methods:
We employed the pytrends library to systematically retrieve data from Google Trends. This method enabled us to capture public interest trends related to social media and mental health across multiple countries. Specifically, we used keyword searches within predefined timeframes for Saudi Arabia, India, and the UAE, which provided us with region-specific data.

#### - Processing and Cleaning Tasks:
During the data collection process, we implemented several cleaning and verification steps to ensure the integrity and completeness of our dataset.<br>
1- Data Availability Check:<br>After each data retrieval attempt, we confirmed that the returned dataset was not empty. If no data was available for a specific year or keyword group, we logged a message for transparency in our collection process.

2- Adding Relevant Columns:<br> We added two new columns—"Year" and "Country"—to each retrieved dataset. This organization enabled easier trend analysis based on these dimensions.

3- Data Concatenation:<br> We combined the retrieved data for each country into a single DataFrame using the pd.concat() function. This merged results from multiple requests while maintaining dataset integrity.

4- Resetting Index:<br> After concatenation, we reset the index of the final DataFrame with reset_index(inplace=True) to ensure a clean, continuous index for easier manipulation.

5- Grouping Data:<br>Once we had the complete dataset, we grouped it by "Year" and "Country" to aggregate the keyword interest values using the groupby() and sum() functions, providing a summarized view of interest across keywords for each year and country.

6- Column Selection:<br> After aggregation, we retained specific columns, including "date," "Year," "Country," and all keyword columns, streamlining the dataset for easier analysis and visualization.

7- Handling Missing Data:<br>We monitored the dataset for any missing or NaN values. Identifying these gaps was crucial for ensuring the accuracy of our findings, although the specifics of addressing them will depend on the final analysis.

By following these steps, we ensured that our dataset was comprehensive, organized, and ready for further analysis of social media and mental health trends across the selected countries.

#### - Decisions Made:

- Keyword Grouping:<br> Due to API limitations, we divided the keywords into two groups. This was necessary to ensure the API could handle the requests efficiently and avoid errors.
- Yearly Timeframes:<br> Since Google Trends only allows data retrieval based on specific date ranges (down to months and days), we manually defined the yearly ranges (e.g., '2020-01-01 to 2020-12-31') to capture trends over entire years. This allowed us to compare year-on-year trends for each keyword in each country.
- Separate Data Processing for Each Country:<br> To maintain accuracy and manage API limitations, we processed the data for each country individually. This approach helped ensure that each country’s trends were collected correctly and allowed us to focus on the unique patterns in each region.

## - Challenges
- Accessing Google Responses:<br>
A significant challenge we encountered was managing multiple requests to the Google Trends API without exceeding the rate limits. To avoid server overload and ensure consistent data retrieval, we implemented a time.sleep(30) command to introduce a delay between each request. This step was crucial for maintaining stable API communication and avoiding request failures due to rate limits.

- Timeframe Specification:<br>
The Google Trends API does not allow direct yearly data retrieval and instead only supports queries based on months and days. To overcome this, we manually defined exact date ranges for each year (e.g., January 1st to December 31st). This allowed us to gather data for entire years, ensuring that the data was structured according to our yearly analysis needs. The process required aggregating monthly data into yearly summaries to align with our research objectives.

- Keyword Grouping Limitations:<br>
The API enforces a limit on the number of keywords that can be queried at once. This limitation led us to divide the keywords into two distinct groups, with each group containing relevant social media and mental health terms. Although this increased the complexity of the data retrieval process, we combined the results during post-processing to ensure all keywords were included in the analysis. This required careful data management and merging.

- Single Country Requests:<br>
The Google Trends API only supports data extraction for one country per request. To manage this, we developed a function that iterated through the countries (Saudi Arabia, India, and the UAE) and collected data separately for each. This significantly increased the time required for data collection but ensured that each country's trends were handled independently, maintaining the integrity and focus of the research.

## - A Bias and Fairness Report:

#### 	•	 Summary of Research on Data Bias and Fairness: 
Our analysis centered on a specific set of keywords that, while relevant, may not encompass all aspects of social media's impact on mental health. This focus could limit the broader applicability of our findings. Additionally, the underrepresentation of certain countries in the data may constrain the analysis. This means that some regions or populations might be overlooked, potentially leading to skewed insights that do not fully reflect global trends or the diverse ways social media impacts mental health across different contexts.

#### 	• Evaluation of Dataset's Potential Biases:  
We identified several potential biases in our dataset. The limited range of keywords might not fully represent all trends related to social media and mental health, which can skew our conclusions. Additionally, there is a risk of underrepresentation of certain countries, as differences in internet access, search behavior, and cultural factors can lead to uneven data availability across regions. These biases may further limit the accuracy and fairness of our analysis.

####	• Implications of Biases:   
The biases in our data collection could significantly affect the credibility and reliability of our research conclusions. If certain trends or demographics are underrepresented, it could lead to misinterpretations of social media's impact on mental health.

####	• Recommendations for Mitigating Biases:   
To enhance the robustness of our research, we recommend expanding the range of keywords used to encompass a wider array of social media platforms and mental health terms. Additionally, collecting data in multiple languages and increasing the scope of the study to include more diverse populations will help provide a more comprehensive understanding of the subject matter.