## The second In-class-exercise (09/13/2023, 40 points in total)

Kindly use the provided .ipynb document to write your code or respond to the questions. Avoid generating a new file.
Execute all the cells before your final submission.

This in-class exercise is due tomorrow September 14, 2023 at 11:59 PM. No late submissions will be considered.

The purpose of this exercise is to understand users' information needs, then collect data from different sources for analysis.

Question 1 (10 points): Describe an interesting research question (or practical question or something innovative) you have in mind, what kind of data should be collected to answer the question(s)? How many data needed for the analysis? The detail steps for collecting and save the data.

In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):


Research Question: "How does the adoption of renewable energy sources impact the energy efficiency and environmental sustainability of urban areas, and what are the economic implications of such a transition?"

Data Collection:

To answer this research question, a multifaceted approach to data collection is required, encompassing various types of data from different sources:

1. **Energy Consumption Data :**
   - Collect historical energy consumption data for urban areas of interest.
   - Data sources can include utility companies, government records, and smart meters.
   - Data should cover a multi-year period to capture trends.

2. **Renewable Energy Adoption Data :**
   - Gather data on the installation and utilization of renewable energy sources, such as solar panels and wind turbines.
   - This data can come from government records, energy companies, and renewable energy organizations.

3. **Environmental Data :**
   - Collect data on air quality, greenhouse gas emissions, and other environmental indicators.
   - Utilize government environmental agencies' data and conduct field measurements.
   - Qualitative data can be gathered through surveys and interviews regarding environmental perceptions.

4. **Economic Data :**
   - Retrieve economic data related to the costs and benefits of renewable energy adoption.
   - This can include data on subsidies, tax incentives, and economic growth.
   - Economic modeling may also be needed to estimate long-term economic impacts.


Data Quantity:

The quantity of data required depends on the scope of the study and the level of detail needed for analysis. For a comprehensive analysis, several years of historical data for energy consumption and environmental indicators are typically necessary. Additionally, data should cover a diverse set of urban areas to ensure representativeness. It is recommended to collect data from at least 100 urban areas for robust statistical analysis.

Data Collection and Storage Steps:

Here's a high-level overview of the steps for collecting and saving the data:

1. **Energy and Environmental Data:**
   - Identify the sources of energy and environmental data.
   - Develop a data collection plan to ensure regular updates.
   - Store this data in a structured format, such as a database or CSV files.

2. **Renewable Energy Adoption Data:**
   - Obtain data from government agencies, energy companies, and research institutions.
   - Organize the data by location and time.
   - Store this data alongside energy consumption data for analysis.

3. **Economic Data:**
   - Retrieve economic data from government reports and databases.
   - Ensure data is available for the same urban areas and time periods as other datasets.
   - Store economic data in a separate database or spreadsheet.

4. **Public Data:**
   - Design surveys and interview questionnaires.
   - Collect responses from residents in the selected urban areas.
   - Combine qualitative responses with quantitative sentiment analysis data.
   - Store survey data in a secure database.

5. **Data Security:**
   - Implement data security measures to protect sensitive information, including encryption and access controls.

Once all data is collected and stored securely, you can proceed with data analysis techniques.

Question 2 (10 points): Write python code to collect 1000 data samples you discussed above.

In [9]:
pip install requests beautifulsoup4

Note: you may need to restart the kernel to use updated packages.


In [20]:
import requests
from bs4 import BeautifulSoup
import csv

# Here i took EIA webpage
url = 'https://www.eia.gov/consumption/data.php'
# I Sent an HTTP GET request to the URL
response = requests.get(url)
# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Parse the HTML content of the page using BeautifulSoup
    soup = BeautifulSoup(response.text, 'html.parser')
    # Find all paragraphs on the page
    paragraphs = soup.find_all('p')
    # Collect 1000 data samples
    data_samples = []
    for paragraph in paragraphs:
        # Extract the text from the paragraph
        data = paragraph.get_text().strip()
        # Add the data to the list of samples
        data_samples.append(data)
        # Check if we have collected 1000 samples, and break the loop if so
        if len(data_samples) >= 1000:
            break
    # Saving data samples to a CSV file
    with open('data_samples.csv', 'w', newline='', encoding='utf-8') as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(['Data Sample'])
        for sample in data_samples:
            writer.writerow([sample])

    print('Data samples saved to data_samples.csv')

else:
    print(f'Failed to retrieve data. Status code: {response.status_code}')

Data samples saved to data_samples.csv


Question 3 (10 points): Write python code to collect 1000 articles from Google Scholar (https://scholar.google.com/), Microsoft Academic (https://academic.microsoft.com/home), or CiteSeerX (https://citeseerx.ist.psu.edu/index), or Semantic Scholar (https://www.semanticscholar.org/), or ACM Digital Libraries (https://dl.acm.org/) with the keyword "information retrieval". The articles should be published in the last 10 years (2013-2023).

The following information of the article needs to be collected:

(1) Title

(2) Venue/journal/conference being published

(3) Year

(4) Authors

(5) Abstract

In [22]:
import requests
from bs4 import BeautifulSoup
import csv

# Variables initializations
articles = []
keyword = "information retrieval"
years_to_check = 10
current_year = 2023

# Loop through the years
for year in range(current_year, current_year - years_to_check, -1):
    # Construct the URL for Google Scholar with the specified year and keyword
    url = f"https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q={keyword}&as_ylo={year}&as_yhi={year}"
    # Send an HTTP GET request to the URL
    response = requests.get(url)
    # Check if the request was successful (status code 200)
    if response.status_code == 200:
        # Parse the HTML content of the page using BeautifulSoup
        soup = BeautifulSoup(response.text, 'html.parser')
        # Find all search result elements
        results = soup.find_all('div', class_='gs_ri')
        # Extract information from each search result
        for result in results:
            title = result.find('h3', class_='gs_rt').a.text.strip()
            authors = result.find('div', class_='gs_a').text.strip()
            venue_year = result.find('div', class_='gs_a').text.strip().split('-')[-1].strip()            
            # Check if there's an abstract element
            abstract_element = result.find('div', class_='gs_rs')
            abstract = abstract_element.text.strip() if abstract_element else ''
            # Append the article data to the list
            articles.append({
                'Title': title,
                'Authors': authors,
                'Venue/Year': venue_year,
                'Abstract': abstract
            })

# Saved articles to a CSV file
csv_filename = 'scholar_articles.csv'
with open(csv_filename, 'w', newline='', encoding='utf-8') as csvfile:
    fieldnames = ['Title', 'Authors', 'Venue/Year', 'Abstract']
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()  # Write the CSV header row    
    for article in articles:
        writer.writerow(article)
print(f'{len(articles)} articles saved to {csv_filename}')

100 articles saved to scholar_articles.csv


Do either of the question-4 tasks given below.

Question 4 (10 points): Write python code to collect 1000 posts from Twitter, or Facebook, or Instagram. You can either use hashtags, keywords, user_name, user_id, or other information to collect the data.

The following information needs to be collected:

(1) User_name

(2) Posted time

(3) Text

In [None]:
# You code here (Please add comments in the code):


Question 4 (10 points):

In this task, you are required to identify and utilize online tools for web scraping data from websites without the need for coding, with a specific focus on Parsehub. The objective is to gather data and save it in formats like CSV, Excel, or any other suitable file format.

You have to mention an introduction to the tool which ever you prefer to use, steps to follow for web scrapping and the final output of the data collected.

Upload a document (Word or PDF File) in the same repository and you can add the link in the ipynb file.

In [None]:
# https://github.com/yaminiravala/5731/blob/main/Exercise-2%20Q-4.docx