# **INFO5731 In-class Exercise 2**

The purpose of this exercise is to understand users' information needs, and then collect data from different sources for analysis by implementing web scraping using Python.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting research question (or practical question or something innovative) you have in mind, what kind of data should be collected to answer the question(s)? Specify the amount of data needed for analysis. Provide detailed steps for collecting and saving the data.

**Research question** : one could explore how the popularity of specific songs on the "mymp3song.site" platform, as indicated by user downloads, correlates with social media trends and discussions. To answer this question, data should be collected from relevant social media platforms where users discuss and share information about music. APIs for platforms like Twitter, Instagram, or Facebook could be utilized to gather user-generated content, including posts, comments, and interactions related to the songs available on "mymp3song.site." The collected data should include information on sentiment, user engagement metrics, and possibly demographic details of the users participating in discussions. Additionally, temporal analysis could be conducted to observe how social media trends evolve over time and whether they coincide with fluctuations in the popularity of specific songs. The amount of data needed would depend on the frequency of discussions and the desired granularity of analysis, but a substantial dataset covering several months could provide more meaningful insights. The data collected should be stored in a structured format, maintaining separate tables for posts, comments, engagement metrics, and song details. It's crucial to ensure compliance with privacy laws and document the data collection methodology for transparency and reproducibility. By analyzing this integrated dataset, researchers could uncover patterns linking social media trends to user engagement and downloads on the music platform, shedding light on the influence of online discussions on consumer behavior in the music industry.

## Question 2 (10 Points)
Write Python code to collect a dataset of 1000 samples related to the question discussed in Question 1.

In [11]:
import requests
import bs4
import os

BASE_URL = 'http://mymp3song.site'
INDEX_FILE = 'index.txt'
SONGS_DIR = 'Songs'


def add_in_file(title):
    try:
        with open(INDEX_FILE, 'a') as index:
            index.seek(0)
            index.writelines(title + "\n")
            print("Title Added Successfully")
    except Exception as e:
        print("Error in index adding: ", e)


def search_in_file(title):
    try:
        with open(os.path.join(SONGS_DIR, INDEX_FILE), 'r') as index:
            return title in index.read()
    except Exception as e:
        print("Error in index searching : ", e)
        return False


def song_download(url, title):
    if not os.path.exists(SONGS_DIR):
        os.mkdir(SONGS_DIR)
    os.chdir(SONGS_DIR)

    try:
        with requests.get(url, stream=True) as req:
            req.raise_for_status()
            with open(title + ".mp3", "wb") as f:
                f.write(req.content)
                print("Downloaded Successfully")
                add_in_file(title)
    except requests.exceptions.RequestException as e:
        print("Error in download:", e)
    finally:
        os.chdir('..')


# Example: Download a specific song
song_url = 'http://mymp3song.site/download/41578/channa-ve'
song_title = 'Channa Ve'

song_download(song_url, song_title)

Downloaded Successfully
Title Added Successfully


## Question 3 (10 Points)
Write Python code to collect 1000 articles from Google Scholar (https://scholar.google.com/), Microsoft Academic (https://academic.microsoft.com/home), or CiteSeerX (https://citeseerx.ist.psu.edu/index), or Semantic Scholar (https://www.semanticscholar.org/), or ACM Digital Libraries (https://dl.acm.org/) with the keyword "XYZ". The articles should be published in the last 10 years (2014-2024).

The following information from the article needs to be collected:

(1) Title of the article

(2) Venue/journal/conference being published

(3) Year

(4) Authors

(5) Abstract

In [None]:
import requests
from bs4 import BeautifulSoup
import json
import time

def scrape_articles(query, num_articles):
    base_url = "https://scholar.google.com/scholar"
    articles = []

    for page in range(0, num_articles, 10):
        params = {
            "start": page,
            "q": query,
            "hl": "en",
            "as_sdt": "0,5",
        }

        try:
            response = requests.get(base_url, params=params)
            response.raise_for_status()  # Check for HTTP errors

            soup = BeautifulSoup(response.text, "html.parser")
            results = soup.find_all("div", class_="gs_ri")

            for result in results:
                title = result.find("h3", class_="gs_rt").text
                authors = result.find("div", class_="gs_a").text
                link = result.find("a")["href"]

                # Fetch additional details from the article page
                article_page_response = requests.get(link)
                article_soup = BeautifulSoup(article_page_response.text, "html.parser")

                venue_element = article_soup.find("div", class_="gs_scl")
                venue = venue_element.text.strip() if venue_element else "N/A"

                year_element = article_soup.find("span", class_="gs_ri")
                year = year_element.text.split(",")[-1].strip() if year_element else "N/A"

                abstract_element = article_soup.find("div", class_="gsh_small")
                abstract = abstract_element.text.strip() if abstract_element else "N/A"

                article_info = {
                    "Title": title,
                    "Authors": authors,
                    "Link": link,
                    "Venue": venue,
                    "Year": year,
                    "Abstract": abstract,
                }
                articles.append(article_info)

                # Pause to avoid rate-limiting
                time.sleep(1)

        except (requests.RequestException, ConnectionError) as e:
            print(f"Error: {e}")
            continue  # Continue with the next iteration

    return articles

if __name__ == "__main__":
    keyword = "Web scrapping"
    num_articles = 1000
    articles = scrape_articles(keyword, num_articles)

    # Print the first article as a sample
    print(json.dumps(articles[0], indent=2))

## Question 4A (10 Points)
Develop Python code to collect data from social media platforms like Reddit, Instagram, Twitter (formerly known as X), Facebook, or any other. Use hashtags, keywords, usernames, or user IDs to gather the data.



Ensure that the collected data has more than four columns.


## Question 4B (10 Points)
If you encounter challenges with Question-4 web scraping using Python, employ any online tools such as ParseHub or Octoparse for data extraction. Introduce the selected tool, outline the steps for web scraping, and showcase the final output in formats like CSV or Excel.



Upload a document (Word or PDF File) in any shared storage (preferably UNT OneDrive) and add the publicly accessible link in the below code cell.

Please only choose one option for question 4. If you do both options, we will grade only the first one

I cannot directly access or interact with external tools, online services, or shared storage. However, I can guide you on how to use a web scraping tool like ParseHub or Octoparse.

**# Mandatory Questionitalicized text

**Important: Reflective Feedback on Web Scraping and Data Collection**



Please share your thoughts and feedback on the web scraping and data collection exercises you have completed in this assignment. Consider the following points in your response:



Learning Experience: Describe your overall learning experience in working on web scraping tasks. What were the key concepts or techniques you found most beneficial in understanding the process of extracting data from various online sources?



Challenges Encountered: Were there specific difficulties in collecting data from certain websites, and how did you overcome them? If you opted for the non-coding option, share your experience with the chosen tool.



Relevance to Your Field of Study: How might the ability to gather and analyze data from online sources enhance your work or research?

**(no grading of your submission if this question is left unanswered)**