## The second In-class-exercise (09/13/2023, 40 points in total)

Kindly use the provided .ipynb document to write your code or respond to the questions. Avoid generating a new file.
Execute all the cells before your final submission.

This in-class exercise is due tomorrow September 14, 2023 at 11:59 PM. No late submissions will be considered.

The purpose of this exercise is to understand users' information needs, then collect data from different sources for analysis.

Question 1 (10 points): Describe an interesting research question (or practical question or something innovative) you have in mind, what kind of data should be collected to answer the question(s)? How many data needed for the analysis? The detail steps for collecting and save the data.


#### Research Question:
Are there any trends or correlations between the choice of domain registrars by Fortune 1000 companies and their respective industries or financial performance?

Data Needed for the Analysis:
Domain Registrar Data: You already have data on which domain registrars are used by Fortune 1000 companies. This data includes the name of the company, its domain, and the registrar.

Industry Classification: You would need information about the industries to which each of the Fortune 1000 companies belongs. This information can help you understand if there are industry-specific trends in the choice of domain registrars. You can obtain industry classification data from sources like the Global Industry Classification Standard (GICS).

Financial Performance Data: To assess if there are correlations between registrar choice and financial performance, you would need financial data for each of the Fortune 1000 companies. This could include metrics like revenue, profit margins, stock performance, and market capitalization. Financial data can be obtained from financial databases like Bloomberg, Yahoo Finance, or directly from the companies' annual reports.

Historical Data: Collect historical data on registrar choices over multiple years if you want to study trends over time. This could involve periodically scraping the registrar data or accessing historical records if available.

#### Steps for Collecting and Saving the Data:

Collect and Prepare Registrar Data: You've already collected the registrar data. Ensure that it is cleaned and formatted correctly.

Collect Industry Classification Data: Obtain industry classification data for the Fortune 1000 companies. This may require web scraping, API access to financial data providers, or manual data entry.

Collect Financial Performance Data:

Choose relevant financial metrics for analysis (e.g., revenue, profit margin, stock price). Access financial databases or company annual reports to collect this data. Ensure the data is cleaned and consistent. Combine Data: Merge the registrar data, industry classification data, and financial performance data into a single dataset. This dataset should have a common identifier (e.g., company name or ticker symbol) that allows you to link the information.

Data Analysis: Perform statistical and correlation analyses to identify any patterns or relationships between registrar choice, industry classification, and financial performance.

Visualization: Create visualizations (e.g., scatter plots, bar charts) to present your findings effectively.

Hypothesis Testing: Formulate hypotheses to test specific relationships (e.g., Does the choice of registrar impact a company's stock performance?). Use appropriate statistical tests to evaluate these hypotheses.

Save Data: Save the cleaned and merged dataset for future reference and analysis.

Report and Interpretation: Summarize your findings in a report or presentation. Interpret the results and provide insights into the relationships you've discovered.

Question 2 (10 points): Write python code to collect 1000 data samples you discussed above.

In [29]:
pwd()

'/Users/shyamsonu/Downloads'

In [12]:

import requests
from bs4 import BeautifulSoup
import pandas as pd

# URL to scrape
url = "https://cyber.harvard.edu/archived_content/people/edelman/fortune-registrars/fortune-list.html"

# Sending a GET request to the website and retrieve the HTML content
response = requests.get(url)
content = response.content

# Parsing the HTML content using BeautifulSoup
soup = BeautifulSoup(content, "html.parser")

# Finding the table containing the data
table = soup.find("table")

# Checking if a table was found
if table:
    # Creating empty lists to store data
    ranks = []
    names = []
    domains = []
    registrars = []

    # Looping through the table rows and extract the information for each data point
    for row in table.find_all("tr")[1:]:
        columns = row.find_all("td")
        rank = columns[0].text.strip()
        name = columns[1].text.strip()
        domain = columns[2].text.strip()
        registrar = columns[3].text.strip()

        ranks.append(rank)
        names.append(name)
        domains.append(domain)
        registrars.append(registrar)

    # Creating a pandas DataFrame to store the data
    data = {
        "Rank": ranks,
        "Name": names,
        "Domain": domains,
        "Registrar": registrars
    }

    df = pd.DataFrame(data)

    # Print the first few rows of the DataFrame
    print(df.head(1000))
    print("")
    print("Shape",df.shape)

else:
    print("Table not found on the webpage.")




     Rank                         Name             Domain   
0       1              Wal-Mart Stores  walmartstores.com  \
1       2                  Exxon Mobil     exxonmobil.com   
2       3               General Motors             gm.com   
3       4                   Ford Motor           ford.com   
4       5                        Enron          enron.com   
..    ...                          ...                ...   
995   996       Daisytek International       daisytek.com   
996   997                   Timberland     timberland.com   
997   998  American Management Systems            ams.com   
998   999                    C.R. Bard         crbard.com   
999  1000                PC Connection   pcconnection.com   

                             Registrar  
0              NETWORK SOLUTIONS, INC.  
1                   REGISTER.COM, INC.  
2                         TUCOWS, INC.  
3              NETWORK SOLUTIONS, INC.  
4                   REGISTER.COM, INC.  
..                   

Question 3 (10 points): Write python code to collect 1000 articles from Google Scholar (https://scholar.google.com/), Microsoft Academic (https://academic.microsoft.com/home), or CiteSeerX (https://citeseerx.ist.psu.edu/index), or Semantic Scholar (https://www.semanticscholar.org/), or ACM Digital Libraries (https://dl.acm.org/) with the keyword "information retrieval". The articles should be published in the last 10 years (2013-2023).

The following information of the article needs to be collected:

(1) Title

(2) Venue/journal/conference being published

(3) Year

(4) Authors

(5) Abstract

In [1]:
# You code here (Please add comments in the code):


import requests

from bs4 import BeautifulSoup

 

def scrape_google_scholar(query, num_articles=1000):

    base_url = "https://scholar.google.com/scholar"

    params = {

        "q": query,

        "as_ylo": 2013,

        "as_yhi": 2023,

        "hl": "en",

        "as_sdt": "0,5",

    }

 

    articles = []

 

    while len(articles) < num_articles:

        params["start"] = len(articles)

        response = requests.get(base_url, params=params)

 

        if response.status_code == 200:

            soup = BeautifulSoup(response.text, "html.parser")

            results = soup.find_all("div", class_="gs_ri")

 

            if not results:

                break

 

            for result in results:

                title = result.find("h3", class_="gs_rt").text

                venue = result.find("div", class_="gs_a").text

                year = result.find("div", class_="gs_a").text.split(" - ")[-1]

                authors = result.find("div", class_="gs_a").text.split(" - ")[0]

                abstract = result.find("div", class_="gs_rs").text

 

                articles.append({

                    "Title": title,

                    "Venue": venue,

                    "Year": year,

                    "Authors": authors,

                    "Abstract": abstract,

                })

 

    return articles

 

# Example usage

keyword = "information retrieval"

num_articles_to_collect = 1000

 

articles = scrape_google_scholar(keyword, num_articles=num_articles_to_collect)

 

for index, article in enumerate(articles, start=1):

    print(f"Article {index}:")

    print("Title:", article["Title"])

    print("Venue:", article["Venue"])

    print("Year:", article["Year"])

    print("Authors:", article["Authors"])

    print("Abstract:", article["Abstract"])

    print("\n")

Article 1:
Title: Information retrieval as statistical translation
Venue: A Berger, J Lafferty - ACM SIGIR Forum, 2017 - dl.acm.org
Year: dl.acm.org
Authors: A Berger, J Lafferty - ACM SIGIR Forum, 2017
Abstract: … There is a large literature on probabilistic approaches to information retrieval, and we will 
not attempt to survey it here. Instead, we focus on the language modeling approach introduced …


Article 2:
Title: [BOOK][B] Information retrieval: Implementing and evaluating search engines
Venue: S Buttcher, CLA Clarke, GV Cormack - 2016 - books.google.com
Year: books.google.com
Authors: S Buttcher, CLA Clarke, GV Cormack
Abstract: … Information retrieval forms the foundation for modern search engines. In this textbook we 
provide an introduction to information retrieval targeted at graduate students and working …


Article 3:
Title: A language modeling approach to information retrieval
Venue: JM Ponte, WB Croft - ACM SIGIR Forum, 2017 - dl.acm.org
Year: dl.acm.org
Authors: JM P

tools for information extraction, data pattern recognition and predictions. From the perspective …


Article 83:
Title: Distributed representations of sentences and documents
Venue: Q Le, T Mikolov - International conference on machine …, 2014 - proceedings.mlr.press
Year: proceedings.mlr.press
Authors: Q Le, T Mikolov - International conference on machine …, 2014
Abstract: … We also test our method on an information retrieval task, where the goal is to decide if a … 
Information Retrieval with Paragraph Vectors We turn our attention to an information retrieval …


Article 84:
Title: [PDF][PDF] Event extraction via dynamic multi-pooling convolutional neural networks
Venue: Y Chen, L Xu, K Liu, D Zeng, J Zhao - Proceedings of the 53rd …, 2015 - aclanthology.org
Year: aclanthology.org
Authors: Y Chen, L Xu, K Liu, D Zeng, J Zhao - Proceedings of the 53rd …, 2015
Abstract: … important information to represent the sentence, we may obtain the information that depicts 
“… In this paper, we f

Year: aclanthology.org
Authors: D Zeng, K Liu, Y Chen, J Zhao - Proceedings of the 2015 …, 2015
Abstract: … that attempt to model such structural information. These approaches usually consider both 
… to capture such structural information. To capture structural and other latent information, we …


Article 125:
Title: Accurately interpreting clickthrough data as implicit feedback
Venue: T Joachims, L Granka, B Pan, H Hembrooke, G Gay - Acm Sigir Forum, 2017 - dl.acm.org
Year: dl.acm.org
Authors: T Joachims, L Granka, B Pan, H Hembrooke, G Gay - Acm Sigir Forum, 2017
Abstract: … To the best of our knowledge, very few studies have used eye-tracking in the context of 
online information retrieval, and none have addressed the issues detailed in this present paper. …


Article 126:
Title: Teaching machines to read and comprehend
Venue: KM Hermann, T Kocisky… - … neural information …, 2015 - proceedings.neurips.cc
Year: proceedings.neurips.cc
Authors: KM Hermann, T Kocisky… - … neural inform

Title: Deep sets
Venue: M Zaheer, S Kottur, S Ravanbakhsh… - … neural information …, 2017 - proceedings.neurips.cc
Year: proceedings.neurips.cc
Authors: M Zaheer, S Kottur, S Ravanbakhsh… - … neural information …, 2017
Abstract: … It is an important task due to wide range of potential applications including personalized 
information retrieval, computational advertisement, tagging large amounts of unlabeled or …


Article 167:
Title: Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline)
Venue: Y Sun, L Zheng, Y Yang, Q Tian… - Proceedings of the …, 2018 - openaccess.thecvf.com
Year: openaccess.thecvf.com
Authors: Y Sun, L Zheng, Y Yang, Q Tian… - Proceedings of the …, 2018
Abstract: … Moreover, for learning the part classifier without labeling information, we compare RPP with 
another potential method derived from Mid-level Element Mining [22,27,8]. Specifically, we …


Article 168:
Title: Updated methodological guidance for the conduct of 

Abstract: … communication intention and emotion information. Then, we … performance of feature-based 
retrieval approach. We adopt sev… examine reranking-enhanced retrieval approaches, which …


Article 208:
Title: Road extraction by deep residual u-net
Venue: Z Zhang, Q Liu, Y Wang - IEEE Geoscience and Remote …, 2018 - ieeexplore.ieee.org
Year: ieeexplore.ieee.org
Authors: Z Zhang, Q Liu, Y Wang - IEEE Geoscience and Remote …, 2018
Abstract: … in the past decade, road extraction from high-resolution remote … : road area extraction and 
road centerline extraction. Road area … information our network regard them as backgrounds. …


Article 209:
Title: YAGO2: A spatially and temporally enhanced knowledge base from Wikipedia
Venue: J Hoffart, FM Suchanek, K Berberich, G Weikum - Artificial intelligence, 2013 - Elsevier
Year: Elsevier
Authors: J Hoffart, FM Suchanek, K Berberich, G Weikum - Artificial intelligence, 2013
Abstract: … of Wikipedia and algorithmic advances in information extr

Article 291:
Title: Linked data-the story so far
Venue: C Bizer, T Heath, T Berners-Lee - Linking the World's Information …, 2023 - dl.acm.org
Year: dl.acm.org
Authors: C Bizer, T Heath, T Berners-Lee - Linking the World's Information …, 2023
Abstract: 8.1 The World Wide Web has radically altered the way we share knowledge, by lowering 
the barrier to publishing and accessing documents as part of a global information space. …


Article 292:
Title: Spectral–spatial feature extraction for hyperspectral image classification: A dimension reduction and deep learning approach
Venue: W Zhao, S Du - IEEE Transactions on Geoscience and Remote …, 2016 - ieeexplore.ieee.org
Year: ieeexplore.ieee.org
Authors: W Zhao, S Du - IEEE Transactions on Geoscience and Remote …, 2016
Abstract: … for spectral and spatial feature extraction, respectively. In this framework, a balanced 
local discriminant embedding algorithm is proposed for spectral feature extraction from high-…


Article 293:
Title: Unsuperv

Authors: XX Zhu, D Tuia, L Mou, GS Xia, L Zhang… - … and remote sensing …, 2017
Abstract: … Already, prior to a joint information extraction, a crucial step involves developing novel archi… 
of pixel information with other sources of data, such as geographic information system layers, …


Article 333:
Title: [BOOK][B] Foundations of decision support systems
Venue: RH Bonczek, CW Holsapple, AB Whinston - 2014 - books.google.com
Year: books.google.com
Authors: RH Bonczek, CW Holsapple, AB Whinston
Abstract: … taken as meaning a plan for information processing that involves some transformation of 
information (and not merely the storage, retrieval, or display of information). A model may be …


Article 334:
Title: Infrared and visible image fusion methods and applications: A survey
Venue: J Ma, Y Ma, C Li - Information fusion, 2019 - Elsevier
Year: Elsevier
Authors: J Ma, Y Ma, C Li - Information fusion, 2019
Abstract: … The keys to an excellent fusion method are effective image informati


Article 416:
Title: [BOOK][B] Information hiding
Venue: S Katzenbeisser, F Petitcolas - 2016 - books.google.com
Year: books.google.com
Authors: S Katzenbeisser, F Petitcolas
Abstract: … recipient can finally extract the secret message from the stego-object; this can typically be 
done without having access to the original unmodified cover—we speak of a blind extraction …


Article 417:
Title: Auto-weighted multi-view clustering via kernelized graph learning
Venue: S Huang, Z Kang, IW Tsang, Z Xu - Pattern Recognition, 2019 - Elsevier
Year: Elsevier
Authors: S Huang, Z Kang, IW Tsang, Z Xu - Pattern Recognition, 2019
Abstract: … Nowadays, more and more datasets are represented by different views where the encoded 
information is complementary to each other. Thus it is critical to designing algorithms to fuse …


Article 418:
Title: Fine-tuning CNN image retrieval with no human annotation
Venue: F Radenović, G Tolias, O Chum - IEEE transactions on pattern …, 2018 - ieeexplore.ieee.org
Y

Venue: J Bengtsson‐Palme, M Ryberg… - Methods in ecology …, 2013 - Wiley Online Library
Year: Wiley Online Library
Authors: J Bengtsson‐Palme, M Ryberg… - Methods in ecology …, 2013
Abstract: … default settings, and the extraction efficiency was examined. … We ask the users to examine 
any cases where the extraction … software utility for robust extraction of the components of the …


Article 524:
Title: The contextual brain: implications for fear conditioning, extinction and psychopathology
Venue: S Maren, KL Phan, I Liberzon - Nature reviews neuroscience, 2013 - nature.com
Year: nature.com
Authors: S Maren, KL Phan, I Liberzon - Nature reviews neuroscience, 2013
Abstract: … As such, contexts enable the flexible representation and retrieval of information and have a 
… contextual information. Indeed, an inability to appropriately contextualize information may …


Article 525:
Title: [BOOK][B] Parallel models of associative memory: updated edition
Venue: GE Hinton, JA Anderson - 2014 -

Year: books.google.com
Authors: LR Rudnick
Abstract: … This book contains information obtained from authentic and highly regarded sources. 
Reasonable efforts have been made to publish reliable data and information, but the author and …


Article 640:
Title: [HTML][HTML] Machine learning and deep learning
Venue: C Janiesch, P Zschech, K Heinrich - Electronic Markets, 2021 - Springer
Year: Springer
Authors: C Janiesch, P Zschech, K Heinrich - Electronic Markets, 2021
Abstract: … Inspired by the principle of information processing in … extraction, model building, and model 
assessment of shallow ML and DL (cf. Figure 2). With explicit programming, feature extraction …


Article 641:
Title: [BOOK][B] Criminal behavior systems: A typology
Venue: M Clinard, R Quinney, J Wildeman - 2014 - books.google.com
Year: books.google.com
Authors: M Clinard, R Quinney, J Wildeman
Abstract: … other means, now known or hereafter invented, including photocopying and recording, or in 
any information stora

Question 4 (10 points):

In this task, you are required to identify and utilize online tools for web scraping data from websites without the need for coding, with a specific focus on Parsehub. The objective is to gather data and save it in formats like CSV, Excel, or any other suitable file format.

You have to mention an introduction to the tool which ever you prefer to use, steps to follow for web scrapping and the final output of the data collected.

Upload a document (Word or PDF File) in the same repository and you can add the link in the ipynb file.

https://console.apify.com/actors/nFJndFXA5zjCTuudP/runs/0j4lYweOLH1TBITAK#output

https://github.com/shyamsundar0329/ShyamSundar_INFO5731_Fall2023/blob/main/In_class_exercise/Domakonda_Exercise_02.docx

 


Apify is a web scraping and automation platform that is both flexible and powerful, and it was developed with the intention of making the process of gathering information from websites more straightforward. Users are able to rapidly scrape data, automate processes, and get access to web-based information thanks to its user-friendly interface and wide selection of capabilities. Apify is especially helpful for organizations, academics, and developers that are looking to collect structured data from the internet for a variety of objectives, including content aggregation, market research, and competition analysis. 

  

Steps to Perform Web Scraping with Apify: 

  

Create a Task: Begin by creating a new task in Apify. A task is essentially a set of instructions that define what data you want to scrape and how to extract it. You can either create a custom task or choose from a library of pre-built ones. 

  

Define Input Parameters: Specify input parameters for your task, such as the target website's URL, the data you want to collect (e.g., text, images, links), and any specific instructions for navigating the website (e.g., clicking buttons, filling out forms). 

  

Configure Scrapers: Apify provides a range of scraping tools, including web scraping actors and crawlers, that can be configured to extract data from websites. You can use CSS selectors or XPath expressions to pinpoint the data you want to scrape. 

  

Set Up Pagination: If your target website has multiple pages of data, configure pagination to ensure that all relevant pages are scraped. Apify allows you to handle pagination automatically. 

  

Run the Task: Execute your scraping task by running it on the Apify platform. Apify will start the process of fetching data from the specified website, following the instructions you provided. 

  

Data Extraction and Storage: As data is scraped, Apify stores it in a structured format, typically in JSON or CSV files. You can choose to store the data on Apify's cloud or download it for local use. 

 

 

 

The output of web scrapped data contains information about search results related to the query "Python, Machine Learning, and Deep Learning" from Google. Here's a brief description of the columns in the output data: 

  

1. `searchQuery/countryCode`: The country code for the search location (e.g., US). 

2. `searchQuery/device`: The type of device used for the search (e.g., DESKTOP). 

3. `searchQuery/domain`: The domain used for the search (e.g., google.com). 

4. `searchQuery/languageCode`: The language code used for the search. 

5. `searchQuery/locationUule`: Location UULE code for the search location. 

6. `searchQuery/page`: The page number of the search results. 

7. `searchQuery/resultsPerPage`: The number of results per page. 

8. `searchQuery/term`: The search term or query used (e.g., "Python, Machine Learning and Deep Learning"). 

9. `searchQuery/type`: The type of search query. 

10. `searchQuery/url`: The URL of the search query. 

11. `resultsTotal`: Total number of results for the search query. 

12. `description`: A brief description or snippet from the search result. 

13. `displayedUrl`: The displayed URL of the search result. 

14. `emphasizedKeywords/0`, `emphasizedKeywords/1`, `emphasizedKeywords/2`: Emphasized keywords in the search result. 

15. `position`: Position or ranking of the search result. 

16. `productInfo/numberOfReviews`: Number of reviews for a product (if applicable). 

17. `productInfo/price`: Price of a product (if applicable). 

18. `productInfo/rating`: Rating of a product (if applicable). 

19. `title`: Title of the search result. 

20. `type`: Type of the search result (e.g., organic). 

21. `url`: URL of the search result. 

22. `date`: Date associated with the search result. 

23. `emphasizedKeywords/3`, `emphasizedKeywords/4`, `emphasizedKeywords/5`: Additional emphasized keywords in the search result. 

24. `siteLinks/0/date`: Date associated with a site link. 

25. `siteLinks/0/description`: Description of a site link. 

26. `siteLinks/0/displayedUrl`: Displayed URL of a site link. 

27. `siteLinks/0/emphasizedKeywords/0`, `siteLinks/0/emphasizedKeywords/1`, `siteLinks/0/emphasizedKeywords/2`: Emphasized keywords in a site link. 

28. `siteLinks/0/title`: Title of a site link. 

29. `siteLinks/0/url`: URL of a site link. 

  

This data appears to contain information related to search results, including titles, descriptions, URLs, and other metadata. It can be useful for various purposes, including analyzing search engine results or extracting specific information from the web. The output provides a structured format for further analysis or processing of the scraped data. 