## The second In-class-exercise (09/13/2023, 40 points in total)

Kindly use the provided .ipynb document to write your code or respond to the questions. Avoid generating a new file.
Execute all the cells before your final submission.

This in-class exercise is due tomorrow September 14, 2023 at 11:59 PM. No late submissions will be considered.

The purpose of this exercise is to understand users' information needs, then collect data from different sources for analysis.

Question 1 (10 points): Describe an interesting research question (or practical question or something innovative) you have in mind, what kind of data should be collected to answer the question(s)? How many data needed for the analysis? The detail steps for collecting and save the data.

In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:

Research Question: What are the most common topics covered in research news articles on the University of North Texas website?

To answer this question, we would need to collect data on the news titles, leads, and story tags from the research news articles on the University of North Texas website.

Data Needed:

News Title
News Lead
Story Tags (if available)
Number of Data Needed for the Analysis:
Since this is an exploratory analysis, it's hard to specify an exact number of data samples needed. However, a larger sample size would provide more reliable insights. For this exercise, you collected up to 1000 data samples, which should be sufficient for an initial analysis.

Steps for Collecting and Saving the Data:

Setting Up the Environment:
Import necessary libraries (urllib.request, BeautifulSoup, and csv).
Initialize variables like news_count, max_news_count, max_pages, and base_url.
Web Scraping:
Use a loop to iterate through pages (in this case, up to 18 pages).
For each page:
Construct the URL.
Send an HTTP GET request to the URL.
Parse the HTML content using BeautifulSoup.
Find all elements with class "news-item".
Extracting Data:
Inside the loop through news items, extract the news title, lead, and story tags (if available).
Storing Data:
Append the extracted data to a list (data).
Saving to CSV:
After collecting all data samples, save the data to a CSV file.
Open a CSV file in write mode, create a CSV writer, write the header row, and then write the data rows.
Print Confirmation:
Print a message to confirm that the data has been saved to the CSV file.


'''

Question 2 (10 points): Write python code to collect 1000 data samples you discussed above.

In [55]:
import urllib.request
from bs4 import BeautifulSoup
import csv

# Initialize variables to count news items and pages
news_count = 0
max_news_count = 1000  # You can adjust this as needed
max_pages = 18  # The number of pages you want to scrape

# Initialize the base URL
base_url = "https://research.unt.edu/news?page="

# Create a list to store the data
data = []

# Loop through each page
for page_num in range(1, max_pages + 1):
    # Construct the URL for the current page
    url = base_url + str(page_num)

    # Send an HTTP GET request to the URL
    response = urllib.request.urlopen(url)

    # Parse the HTML content of the page using BeautifulSoup
    soup = BeautifulSoup(response, 'html.parser')

    # Find all elements with class "news-item"
    news_items = soup.find_all(class_='news-item')

    # Loop through the found news items and extract title, lead, and story tags
    for item in news_items:
        if news_count >= max_news_count:
            break

        # Extract the news title (update class name as needed)
        news_title_element = item.find(class_='news-title')  # Replace 'news-title' with the correct class name
        news_title = news_title_element.text.strip() if news_title_element else "Title Not Available"

        # Extract the news lead
        news_lead_element = item.find(class_='news-lead')
        news_lead = news_lead_element.text.strip() if news_lead_element else "Lead Not Available"

        # Extract the story tags, if available
        story_tags_element = item.find_all(class_='story-tags')
        story_tags = ', '.join([tag.text.strip() for tag in story_tags_element]) if story_tags_element else "Story Tags Not Available"

        # Append the data to the list
        data.append([news_title, news_lead, story_tags])

        news_count += 1

    # If you reached the maximum news count, exit the loop
    if news_count >= max_news_count:
        break

# Save the data to a CSV file
csv_filename = "news_data.csv"
with open(csv_filename, 'w', newline='', encoding='utf-8') as csvfile:
    csv_writer = csv.writer(csvfile)
    
    # Write the header row
    csv_writer.writerow(["News Title", "News Lead", "Story Tags"])
    
    # Write the data
    csv_writer.writerows(data)

print(f"Data saved to {csv_filename}")



Data saved to news_data.csv


Question 3 (10 points): Write python code to collect 1000 articles from Google Scholar (https://scholar.google.com/), Microsoft Academic (https://academic.microsoft.com/home), or CiteSeerX (https://citeseerx.ist.psu.edu/index), or Semantic Scholar (https://www.semanticscholar.org/), or ACM Digital Libraries (https://dl.acm.org/) with the keyword "information retrieval". The articles should be published in the last 10 years (2013-2023).

The following information of the article needs to be collected:

(1) Title

(2) Venue/journal/conference being published

(3) Year

(4) Authors

(5) Abstract

In [10]:
pip install selenium

Collecting selenium
  Downloading selenium-4.12.0-py3-none-any.whl (9.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.4/9.4 MB[0m [31m31.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting trio~=0.17 (from selenium)
  Downloading trio-0.22.2-py3-none-any.whl (400 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m400.2/400.2 kB[0m [31m33.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting trio-websocket~=0.9 (from selenium)
  Downloading trio_websocket-0.10.4-py3-none-any.whl (17 kB)
Collecting outcome (from trio~=0.17->selenium)
  Downloading outcome-1.2.0-py2.py3-none-any.whl (9.7 kB)
Collecting exceptiongroup>=1.0.0rc9 (from trio~=0.17->selenium)
  Downloading exceptiongroup-1.1.3-py3-none-any.whl (14 kB)
Collecting wsproto>=0.14 (from trio-websocket~=0.9->selenium)
  Downloading wsproto-1.2.0-py3-none-any.whl (24 kB)
Collecting h11<1,>=0.9.0 (from wsproto>=0.14->trio-websocket~=0.9->selenium)
  Downloading h11-0.14.0-py3-none-any.whl

In [49]:
import requests
from bs4 import BeautifulSoup
import csv

# Define the URL and headers
base_url = "https://dl.acm.org/action/doSearch"
headers = {
    "User-Agent": "Your User-Agent Header Here",
}

# Initialize variables
total_samples = 1000
samples_collected = 0
page_number = 1

# Initialize a list to store the data
article_data = []

while samples_collected < total_samples:
    # Define query parameters for the search
    params = {
        "AllField": "data",
        "expand": "all",
        "ConceptID": "118230",
        "pageNumber": page_number,
    }

    # Send an HTTP GET request to the URL with query parameters
    response = requests.get(base_url, params=params, headers=headers)

    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')

        # Find the section with the article titles and authors
        articles = soup.find_all('h5', class_='issue-item__title')

        # Extract article titles and authors
        for article in articles:
            title = article.text.strip()
            article_data.append({'Title': title})
            samples_collected += 1

        # Increment the page number for pagination
        page_number += 1

        # Break the loop if the desired number of samples is reached
        if samples_collected >= total_samples:
            break
    else:
        print("Failed to retrieve the webpage. Status code:", response.status_code)
        break

# Save the data to a CSV file
with open('acm_articles.csv', 'w', newline='', encoding='utf-8') as csvfile:
    fieldnames = ['Title']
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()
    writer.writerows(article_data)

print(f"Scraped {samples_collected} articles and saved to acm_articles.csv")

Failed to retrieve the webpage. Status code: 403
Scraped 660 articles and saved to acm_articles.csv


Question 4 (10 points): Write python code to collect 1000 posts from Twitter, or Facebook, or Instagram. You can either use hashtags, keywords, user_name, user_id, or other information to collect the data.

The following information needs to be collected:

(1) User_name

(2) Posted time

(3) Text

In [51]:
# You code here (Please add comments in the code):

# You code here (Please add comments in the code):

import instaloader

# Initialize Instaloader
L = instaloader.Instaloader()

# Define the target Instagram account (user_name or user_id)
target_account = "unt"

# Load the profile of the target account
profile = instaloader.Profile.from_username(L.context, target_account)

# Initialize a list to store the collected data
posts_data = []

# Collect posts
for post in profile.get_posts():
    # Get user_name, posted_time, and text
    user_name = post.owner_username
    posted_time = post.date
    text = post.caption if post.caption else ""

    # Append the data to the list
    posts_data.append([user_name, posted_time, text])

    # Break the loop once 1000 posts are collected
    if len(posts_data) >= 1000:
        break

# Save the collected data to a CSV file
import csv

csv_file_name = 'instagram_posts.csv'
with open(csv_file_name, 'w', newline='', encoding='utf-8') as csv_file:
    csv_writer = csv.writer(csv_file)
    csv_writer.writerow(['User Name', 'Posted Time', 'Text'])
    csv_writer.writerows(posts_data)
    
print(f'Collected and saved {len(posts_data)} Instagram posts from @{target_account} to {csv_file_name}.')


Collected and saved 1000 Instagram posts from @unt to instagram_posts.csv.
