<a href="https://colab.research.google.com/github/vodnalashiva131/INFO-5731/blob/main/Vodnala_shiva_exercise_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 2**

The purpose of this exercise is to understand users' information needs, and then collect data from different sources for analysis by implementing web scraping using Python.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting research question (or practical question or something innovative) you have in mind, what kind of data should be collected to answer the question(s)? Specify the amount of data needed for analysis. Provide detailed steps for collecting and saving the data.

Research Question: What are the factors that influence the success of a YouTube video?

Data to Collect:
*   Video title
*   Video description
*   Video category
*   Video length
*   Number of views
*   Number of likes
*   Number of dislikes
*   Number of comments
*   Video upload date
*   Channel name
*   Channel subscriber count

Amount of Data Needed:
*   We need at 1000 video for analysis

Steps for Collecting and Saving the Data:
1. Web Scraping:
    *   To extract data from YouTube,we should use a web scraping tool like Beautiful Soup or Selenium.
    *   We should create a script that opens the YouTube search page and types in the relevant query.
    *   It is required to take what information is pertinent from the search results.
2. Data Cleaning:
    *   Remove any duplicate data.
    *   Remove any missing data.
    *   Format the extracted data into a consistent format were it is supported.
3. Data Saving:
    *   Cleaned data should in a CSV formated file.
    *   Rename the file according to the requirment.
    *   Best way to save data or prevent losing it should be storing in a safe location.

## Question 2 (10 Points)
Write Python code to collect a dataset of 1000 samples related to the question discussed in Question 1.

In [None]:
# Seting libraries for chrome
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
import pandas as pd
from bs4 import BeautifulSoup

# Seting up the chrome
chrome = Options()
chrome.add_argument('--headless')
chrome.add_argument('--no-sandbox')
chrome.add_argument('--disable-dev-shm-usage')

# intializing the chrome driver
chrome_driver = webdriver.Chrome(options=chrome)

# BASE URL
base_url = "https://www.youtube.com/results?search_query="

# Declaring the Search Term
search_term = "youtube"

# Adding the Base URL and with the search term for generating the main URL
search_url = base_url + search_term

data = []

for i in range(1000):
    # Opening the search URL
    chrome_driver.get(search_url)
    time.sleep(5)
    page_source = chrome_driver.page_source
    soup = BeautifulSoup(page_source, 'html.parser')

    # Extracting the Video Indetails from the search URL
    video_titles = [a.text for a in soup.select("h3 > a")]
    video_urls = [a['href'] for a in soup.select("h3 > a")]
    views = [span.text for span in soup.select("span.view-count")]
    upload_dates = [span.text for span in soup.select("span.date")]
    channel_names = [a.text for a in soup.select("a.yt-user-name")]

    # Using the apend function to append the data
    data.append({
        "title": video_titles,
        "url": video_urls,
        "views": views,
        "upload_date": upload_dates,
        "channel_name": channel_names
    })

    # Extracting the next data from the next URL
    next_page_url = soup.select_one("a[aria-label='Next page']")
    if next_page_url:
        search_url = "https://www.youtube.com" + next_page_url['href']
    else:
        break

    # Using this from avoid blocking us from the server
    time.sleep(5)
chrome_driver.quit()
# Converting the data into a Pandas Frame
df = pd.DataFrame(data)

# Save the DataFrame to a CSV file
df.to_csv("youtube_data.csv", index=False)


## Question 3 (10 Points)
Write Python code to collect 1000 articles from Google Scholar (https://scholar.google.com/), Microsoft Academic (https://academic.microsoft.com/home), or CiteSeerX (https://citeseerx.ist.psu.edu/index), or Semantic Scholar (https://www.semanticscholar.org/), or ACM Digital Libraries (https://dl.acm.org/) with the keyword "XYZ". The articles should be published in the last 10 years (2014-2024).

The following information from the article needs to be collected:

(1) Title of the article

(2) Venue/journal/conference being published

(3) Year

(4) Authors

(5) Abstract

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from tqdm import tqdm
import time


class ACMScraper(object):
    def __init__(self):
        pass

    def payload(self, keyword, st_page=0, pasize=50, start_year=2018, end_year=2022):
        params = (
            ("AllField", keyword),
            ("AfterYear", str(start_year)),
            ("BeforeYear", str(end_year)),
            ("queryID", "45/3852851837"),
            ("sortBy", "relevancy"),
            ("startPage", str(st_page)),
            ("pageSize", str(pasize)),
        )

        response = requests.get(
            "https://dl.acm.org/action/doSearch",
            params=params,
            headers={"accept": "application/json"},
        )
        soup = BeautifulSoup(response.text, "html.parser")

        return soup

    def soup_html(self, soup):
        all_papers = []
        main_class = soup.find(
            "div", {"class": "col-lg-9 col-md-9 col-sm-8 sticko__side-content"}
        )
        main_c = main_class.find_all("div", {"class": "issue-item__content"})

        for paper in main_c:
            temp_data = {}

            try:
                content_ = paper.find("h5", {"class": "issue-item__title"})
                paper_url = content_.find("a", href=True)["href"].split("/")

                title = content_.text
                temp_data["Title"] = title

                doi_url = ["https://dl.acm.org", "doi", "pdf"]
                doi_url.extend(paper_url[2:])
                temp_data["Link"] = "/".join(doi_url)

                # Extract additional details
                details = paper.find("div", {"class": "issue-item__detail"})
                venue = details.find("span", {"class": "issue-item__detail__text"})
                if venue:
                    temp_data["Venue"] = venue.text

                year = details.find("span", {"class": "issue-item__detail__text"})
                if year:
                    temp_data["Year"] = year.text

                authors = details.find("span", {"class": "loa__author-name"})
                if authors:
                    temp_data["Authors"] = authors.text

                abstract = paper.find("div", {"class": "issue-item__abstract"})
                if abstract:
                    temp_data["Abstract"] = abstract.text.strip()

                all_papers.append(temp_data)
            except Exception as e:
                print(f"Error processing paper: {e}")

        df = pd.DataFrame(all_papers)
        return df

    def acm(
        self,
        keyword,
        max_pages=5,
        min_year=2015,
        max_year=2022,
        full_page_result=False,
        api_wait=5,
    ):
        all_pages = []

        for page in tqdm(range(max_pages)):
            acm_soup = self.payload(
                keyword, st_page=page, pasize=50, start_year=min_year, end_year=max_year
            )

            acm_result = self.soup_html(acm_soup)
            all_pages.append(acm_result)
            time.sleep(api_wait)

        df = pd.concat(all_pages)
        return df


if __name__ == "__main__":
    keyword = "XYZ"
    acm_scraper = ACMScraper()
    articles_df = acm_scraper.acm(keyword, max_pages=5)
    print(articles_df)

 80%|████████  | 4/5 [00:44<00:10, 10.68s/it]

Error processing paper: 'NoneType' object has no attribute 'find'


100%|██████████| 5/5 [00:56<00:00, 11.35s/it]

                                                Title  \
0   The xyz algorithm for fast interaction search ...   
1   Importance of Internalization of Tacit Knowled...   
2   Data Governance to Improve Data Quality for St...   
3   XYZ-Randomization using TSVs for Low-Latency E...   
4   Implementation of The Fuzzy Inference System t...   
..                                                ...   
44  MadMax: surviving out-of-gas conditions in Eth...   
45  A Machine Learning Approach for Detection Plan...   
46  PLL to the rescue: a novel EM fault countermea...   
47  Why GPUs are Slow at Executing NFAs and How to...   
48  Analyzing the Keystroke Dynamics of Web Identi...   

                                                 Link  \
0   https://dl.acm.org/doi/pdf/10.5555/3291125.329...   
1   https://dl.acm.org/doi/pdf/10.1145/3429789.342...   
2   https://dl.acm.org/doi/pdf/10.1145/3451471.345...   
3   https://dl.acm.org/doi/pdf/10.1145/3130218.313...   
4   https://dl.acm.org/doi/pdf




## Question 4A (10 Points)
Develop Python code to collect data from social media platforms like Reddit, Instagram, Twitter (formerly known as X), Facebook, or any other. Use hashtags, keywords, usernames, or user IDs to gather the data.



Ensure that the collected data has more than four columns.


In [None]:
%pip install -q instaloader

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.3.1 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [None]:
import instaloader
from instaloader.exceptions import QueryReturnedNotFoundException
from datetime import datetime
import csv

class GetInstagramProfile():
    def __init__(self):
        self.L = instaloader.Instaloader()

    def download_users_profile_picture(self, username):
        self.L.download_profile(username, profile_pic_only=True)

    def download_hashtag_posts(self, hashtag):
        try:
            for post in instaloader.Hashtag.from_name(self.L.context, hashtag).get_posts():
                self.L.download_post(post, target='#' + hashtag)
        except QueryReturnedNotFoundException as e:
            print(f"Error: {e}")
            print(f"The following skipped Hashtag is: {hashtag}")

if __name__ == "__main__":
    cls = GetInstagramProfile()
    hashtags = ["gadgets", "another_hashtag"]

    for hashtag in hashtags:
        cls.download_hashtag_posts(hashtag)

  readline_hook.enable(use_pyreadline=use_pyreadline)
JSON Query to explore/tags/gadgets/: 404 Not Found [retrying; skip with ^C]
JSON Query to explore/tags/gadgets/: 404 Not Found [retrying; skip with ^C]


Error: JSON Query to explore/tags/gadgets/: 404 Not Found
The following skipped Hashtag is: gadgets


JSON Query to explore/tags/another_hashtag/: 404 Not Found [retrying; skip with ^C]
JSON Query to explore/tags/another_hashtag/: 404 Not Found [retrying; skip with ^C]


Error: JSON Query to explore/tags/another_hashtag/: 404 Not Found
The following skipped Hashtag is: another_hashtag


## Question 4B (10 Points)
If you encounter challenges with Question-4 web scraping using Python, employ any online tools such as ParseHub or Octoparse for data extraction. Introduce the selected tool, outline the steps for web scraping, and showcase the final output in formats like CSV or Excel.



Upload a document (Word or PDF File) in any shared storage (preferably UNT OneDrive) and add the publicly accessible link in the below code cell.

Please only choose one option for question 4. If you do both options, we will grade only the first one

# Mandatory Questionitalicized text

Learning Experience:
The web scraping assignments gave participants practical experience obtaining data from internet sources. Understanding HTML structure, processing HTTP requests, and parsing HTML using BeautifulSoup were among the most important lessons learned. Developing critical abilities such as navigating and adjusting to various website architectures was necessary.

Challenges Encountered while performing this exercise:
There were difficulties with dynamic material loaded via JavaScript. Alternative data sources or headless browsers were required as solutions. Ethical scraping required careful attention to rate restrictions, avoiding IP blocking, and observing terms of service on websites.

similar Topic to my Field of Study:
It is useful in many different sectors to collect and analyse data from internet sources. Web scraping offers research and decision-making insights in data science, social sciences, and business analytics. Decisions are made with more knowledge and based on data thanks to the improved data collecting efficiency brought about by the learned abilities.