In [4]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

##  **Elevate Your Analysis: Mastering Data Gathering Techniques**

This title highlights the importance of utilizing open source APIs and web scraping techniques to gather data for analysis when readily available data in CSV, txt or Excel files is limited. It conveys the idea that data analysis often requires exploring alternative sources to access valuable data for analysis purposes.

This notebook explores different methods and tools used to collect data from various sources and preparing it for analysis or for open source sharing. It covers techniques such as web scraping and API integration.

## **Table of Contents**
1. Introduction
2. Web Scraping
3. API Integration
4. Conclusion

## **1. Introduction**
Welcome to the "Data Gathering Techniques for Analysis" notebook. In this notebook, we will explore the critical process of gathering data for analysis and its significance in deriving meaningful insights. We will highlight a range of techniques that enable us to acquire data from diverse sources, ensuring a comprehensive understanding of data collection methods.


## **2. Web Scraping**
This section focuses on web scraping as a technique to extract data from websites. We will use BeautifulSoup library and other commonly used tools for web scraping. The code snippets below demonstrate how to scrape data from HTML web pages and store it in a structured format for analysis.

We will focus on web scraping techniques to gather user reviews data from the Consumer Affairs [website](http://www.consumeraffairs.com), specifically targeting reviews related to Pizza Hut.It is worth noting that the same techniques can be applied (with some tweaking) to scrape reviews for other businesses or products as well, providing a versatile approach to data gathering.

Consumer Affairs serves as an invaluable platform for consumers to share their feedback and opinions, making it an excellent source of information for analysis purposes.

This data can serve as a valuable resource for sentiment analysis, trend identification, and customer satisfaction assessments.

Let's proceed with web scraping techniques to gather the desired Pizza Hut reviews data


In [None]:
# importing neccesary libraries
import requests
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup as bs

# Set the number of pages to scrape
pages = 6

# Create an empty DataFrame to store the final results
final_df = pd.DataFrame()

# Iterate through each page
for page in range(1, pages+1):

    # Set the request header
    header = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.162 Safari/537.36'}

    # Construct the URL for the current page
    url = f'https://www.consumeraffairs.com/food/pizza-hut.html?page={page}'

    # Send a GET request to the URL
    response = requests.get(url, headers=header)
    
    # Create a BeautifulSoup object to parse the HTML content
    soup = bs(response.text)
    
    # Beautify the HTML content (optional)
    soup.prettify()
    
    # Find all the relevant div elements for name, location, review date, review, and star rating
    first = soup.find_all('div', attrs={'class': "rvw-bd"})
    second = soup.find_all('div', attrs={'class': "rvw-aut"})
    third = soup.find_all('div', attrs={'class': "rvw__hdr"})

    # Initialize empty lists to store the extracted data
    name = []
    location = []
    review_date = []
    review = []
    star_rating = []
    
    # Extract the review date for each review
    for i in first:
        try:
            review_date.append(i.find('span').text.split(': ')[1])
        except:
            review_date.append(np.nan)
    
    # Extract the review text for each review
    for i in first:
        try:
            review.append(i.find_all('p')[1].text)
        except:
            review.append('no review')
    
    # Extract the name and location for each review
    for i in second:
        try:
            temp_lst = i.find('span').text.split(' of ')
            name.append(temp_lst[0])
            location.append(temp_lst[1])
        except:
            name.append('unknown')
            location.append('unknown')
    
    # Extract the star rating for each review
    for i in third:
        try:
            # Find the meta tag with itemprop="ratingValue" and extract the content attribute value
            rating = i.find('meta', attrs={'itemprop': 'ratingValue'}).get('content')
            if rating:
                star_rating.append(rating)
            else:
                star_rating.append(np.nan)
        except:
            star_rating.append(np.nan)
    
    # Create a temporary DataFrame with the extracted data for the current page
    temp_df = pd.DataFrame({'name': name, 'location': location, 'review_date': review_date, 'review': review, 'star_rating': star_rating})
    
    # Concatenate the temporary DataFrame with the final DataFrame
    final_df = pd.concat([final_df, temp_df], ignore_index=True)


**Explanation of the main points of the code:**

1. The code uses web scraping techniques to gather user reviews data from the Consumer Affairs website, focusing on Pizza Hut reviews.

2. The requests library is used to send HTTP requests, and the BeautifulSoup library is used to parse the HTML content of the website.
3. The code specifies the number of pages to scrape and creates an empty DataFrame (final_df) to store the results.
4. It loops through each page, sends a GET request to the website, and parses the HTML content using BeautifulSoup.
5. The code extracts the relevant information from the HTML, including the name, location, review date, review text, and star rating of each review.
6. Exception handling is implemented to handle cases where the desired information is not available or encounters an error.
7. The extracted data is stored in temporary lists, and a temporary DataFrame (temp_df) is created for each page.
8. The temporary DataFrame is concatenated with the final DataFrame (final_df) to combine the data from all pages.
9. After the loop finishes, the final DataFrame contains the scraped data from all pages.
10. The scraped data can be further analyzed, processed, or saved for future use.

In [204]:
final_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 713 entries, 0 to 712
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   name         713 non-null    object
 1   location     713 non-null    object
 2   review_date  713 non-null    object
 3   review       713 non-null    object
 4   star_rating  681 non-null    object
dtypes: object(5)
memory usage: 28.0+ KB


 ## **3. API Integration**

In this section, we delve into the process of integrating APIs to gather data for analysis. Our focus is on utilizing the TMDB Top Rated API, which provides information about the highest-rated movies. By leveraging this API, we can access a comprehensive dataset on popular films.

To demonstrate the API integration, we utilize the [RapidAPI](https://rapidapi.com/collection/list-of-free-apis) platform, which offers a range of open-source APIs. The provided code snippets showcase how to establish a connection with the TMDB API, send requests to retrieve the desired data, and handle the received data for analysis.

Authentication is an essential aspect when working with APIs, and we cover this topic in the context of accessing the TMDB API. The code examples highlight the necessary steps to authenticate your requests, ensuring seamless access to the desired data.

By leveraging APIs, we can tap into extensive sources of information and harness it for our data analysis endeavors. The provided code snippets serve as practical illustrations, guiding the process of fetching data from the TMDB Top Rated API for analysis purposes.

In [None]:
# importing libraries
import requests
import pandas as pd
import numpy as np

# Set the number of pages to fetch
pages = 500

# Create an empty DataFrame
final_df = pd.DataFrame()

# Loop through each page
for page in range(1, pages+1):
    # Construct the URL for the API request
    url = "https://api.themoviedb.org/3/movie/top_rated?language=en-US&page={}".format(page)

    # Set the headers for the API request
    headers = {
        "accept": "application/json",
        "Authorization": "Bearer YOUR_API_KEY"
    }

    # Send the API request
    response = requests.get(url, headers=headers)

    # Convert the response to a DataFrame
    temp_df = pd.DataFrame(response.json()['results'])
    temp_df = temp_df[['id', 'title', 'release_date', 'overview', 'popularity', 'vote_average', 'vote_count']]

    # Append the data to the final DataFrame
    final_df = pd.concat([final_df, temp_df], ignore_index=True)


**Explanation of the main points of the code:**

1. The code above fetches data from the TMDB API for the top-rated movies. It retrieves data from multiple pages by iterating over the specified number of pages (500 in this case).

2. To use this code, you'll need to replace 'YOUR_API_KEY' in the Authorization header with your actual TMDB API key.

3. The API response is converted to a DataFrame, and only the desired columns (id, title, release_date, overview, popularity, vote_average, vote_count) are selected.

4. The fetched data is then appended to the final_df DataFrame using pd.concat() with the ignore_index=True parameter to ensure proper indexing.


In [25]:
final_df

Unnamed: 0,id,title,release_date,overview,popularity,vote_average,vote_count
0,238,The Godfather,1972-03-14,"Spanning the years 1945 to 1955, a chronicle o...",97.119,8.7,18014
1,278,The Shawshank Redemption,1994-09-23,Framed in the 1940s for the double murder of h...,82.607,8.7,23861
2,240,The Godfather Part II,1974-12-20,In the continuing saga of the Corleone crime f...,63.884,8.6,10879
3,19404,Dilwale Dulhania Le Jayenge,1995-10-19,"Raj is a rich, carefree, happy-go-lucky second...",22.713,8.6,4137
4,424,Schindler's List,1993-12-15,The true story of how businessman Oskar Schind...,46.255,8.6,14114
...,...,...,...,...,...,...,...
9995,12521,Shocker,1989-10-27,"After being sent to the electric chair, a seri...",10.974,5.6,353
9996,298722,Soap Opera,2014-10-23,When the kooky tenants of an apartment block e...,3.860,5.6,227
9997,467673,Budapest,2018-06-27,Two best friends stuck in boring jobs become b...,5.854,5.6,389
9998,101179,Truth or Dare,2012-08-05,A group of friends are lured to an isolated ca...,11.796,5.6,312


## **4. Conclusion**

In the final section, we summarize the key takeaways from this notebook. We emphasize the importance of data gathering techniques in the data analysis process and discuss the advantages and considerations associated with each method. We also suggest potential future directions for expanding on the techniques covered.