# DSCI 614: Project 2

### Symphony Hopkins

## Introduction

We are part of a data scientist team working for the Department of Transportation. We have built a road condition dashboard. Our manager wants us to have more search data. Our manager lets us monitor the Google Search feeds and get the latest 100 searches regarding the weather. We are asked to perform the following tasks.


## Step 1: Search Google using the keywords "Winter snowstorm". (Please use the sample codes in the attached workbook Google Search extract info (1).ipynb 

We will begin by importing the necessary libraries.

In [1]:
# importing libraries
from googlesearch import search
import pandas as pd

Now, we will provide the arugments for the search function. We want to use "Winter snowstorm" as the query. We also want to return 100 results, including the URL, title, and description. 

In [2]:
#providing the query for the search
query = "Winter snowstorm"

#returning 100 results, including the URL, title, and description 
results = list(search(query, num_results=99, sleep_interval=5, lang="en",advanced=True))

#printing results
print(results)



## Step 2: Extract the lasted 100 results, including the URL, title, and description. (If you get an error message of “429 Client Error: Too Many Requests for url:” . You can try smaller num_results such as 80 or 50 or 20 or 10  instead of 100.)

Before we can extract the information, we need to convert the SearchResult objects to a list of strings.

In [3]:
results = [str(item) for item in results]

Now, we can create a for loop and extract the information from the results. After extracting the URL, title, and description from each result, we will store it in a dictionary.

In [4]:
data =[]
for result in results:
    #extracting the URL, title, and description using string manipulation
    url = result.split("url=")[1].split(",")[0]
    title = result.split("title=")[1].split(",")[0]
    description = result.split("description=")[1].split(")")[0]
    
    #creating a dictionary for each SearchResult
    result_dict = {
        'URL': url,
        'Title': title,
        'Description': description
    }
    
    #appending the dictionary to the data list
    data.append(result_dict)

## Step 3: Concatenate the URL, title, and description and obtain a new column of search_result.


Now, we will save the results in a dataframe, and concatenate the URL, title, and description to obtain a new column named "search_results".

In [5]:
#creating the dataframe
df = pd.DataFrame(data)

#merging URL, title, and description columns
df['search_result'] = df['URL'] + ' ' + df['Title'] + ' ' + df['Description']

#displaying results
df.head()

Unnamed: 0,URL,Title,Description,search_result
0,https://www.nssl.noaa.gov/education/svrwx101/w...,Severe Weather 101: Winter Weather Types,"A winter storm is a combination of heavy snow,...",https://www.nssl.noaa.gov/education/svrwx101/w...
1,https://www.weather.gov/safety/winter-snow,Snow Storm Safety,Blizzard: Sustained winds or frequent gusts of...,https://www.weather.gov/safety/winter-snow Sno...
2,https://scied.ucar.edu/learning-zone/storms/wi...,Winter Storms - UCAR Center for Science Education,Snowstorms are one type of winter storm. Blizz...,https://scied.ucar.edu/learning-zone/storms/wi...
3,https://en.wikipedia.org/wiki/Winter_storm,Winter storm,A winter storm is an event in which wind coinc...,https://en.wikipedia.org/wiki/Winter_storm Win...
4,https://www.ready.gov/winter-weather,Winter Weather,Winter storms including blizzards can bring ex...,https://www.ready.gov/winter-weather Winter We...


Since steps 4-8 focus on modifying the search_result column, we will create a copy of the column before modification so we can compare the differences at the end.

In [6]:
#creating column copy
search_result_before = df['search_result'].copy()

## Step 4: Remove the date and time in the search_result using a regular expression.

We are going to use regular expression (regex) to remove date and time from our new column. Date and time can be displayed in various formats. To make matters simple, we will only focus on removing these specific formats:
* Date: Month D, Yr (Sept 07, 2023)
* Time: HH:MM XM (10:08 PM)

We will begin by importing the regex library.

In [7]:
#importing library
import re

Now we can remove the date and time using regex.

In [8]:
#defining regex pattern
#[A-Za-z]+ character set that matches upper/lowercase letter (A-Z and a-z) one or more times
#\s matches white space
#\d{number} matches number of digits
# , matches the comma
# | acts as OR operator
# : matches colon
#[APap][Mm] matches AM, PM, am, pm
dt_tm_pattern = r"[A-Za-z]+\s\d{1,2},\s\d{4}|\d{1,2}:\d{2} [APap][Mm]"

#creating empty list to save modified strings
mod_col = []

#removing pattern
for i in range(len(df['search_result'])):
    mod_str = re.sub(dt_tm_pattern, '', df['search_result'][i])
    mod_col.append(mod_str)
    
#replacing the column values with the modified values 
df['search_result'] = mod_col

#displaying results
df['search_result'].head()

0    https://www.nssl.noaa.gov/education/svrwx101/w...
1    https://www.weather.gov/safety/winter-snow Sno...
2    https://scied.ucar.edu/learning-zone/storms/wi...
3    https://en.wikipedia.org/wiki/Winter_storm Win...
4    https://www.ready.gov/winter-weather Winter We...
Name: search_result, dtype: object

## Step 5: Remove the hyperlink URL in the search_result using a regular expression.

Next we will remove the hyperlink URL using regex and replace it with an empty string.

In [9]:
#defining regex pattern
#https:// matches https://
#\S+ matches one or more non-whitespace characters which is the domain name after https://
url_pattern = r'https://\S+'

#creating empty list to save modified strings
mod_col = []

#removing pattern
for i in range(len(df['search_result'])):
    mod_str = re.sub(url_pattern, '', df['search_result'][i])
    mod_col.append(mod_str)
    
#replacing the column values with the modified values 
df['search_result'] = mod_col

#displaying results
df['search_result'].head()

0     Severe Weather 101: Winter Weather Types A wi...
1     Snow Storm Safety Blizzard: Sustained winds o...
2     Winter Storms - UCAR Center for Science Educa...
3     Winter storm A winter storm is an event in wh...
4     Winter Weather Winter storms including blizza...
Name: search_result, dtype: object

## Step 6: Remove all words containing at most two characters such as "a", "an", "in", "on", "etc".

Let's clean the columns furthermore by removing "a", "an", "in", "on", "etc".

In [10]:
#defining regex pattern
#\b word boundary anchor that ensures matched letters are part of the whole word
#[a-zA-Z] character set that matches upper/lowercase letter
#{1,2} match either one or two consecutive alphabetic characters

less_than_2_pattern = r'\b[a-zA-Z]{1,2}\b'

#creating empty list to save modified strings
mod_col = []

#removing pattern
for i in range(len(df['search_result'])):
    mod_str = re.sub(less_than_2_pattern, '', df['search_result'][i])
    mod_col.append(mod_str)
    
#replacing the column values with the modified values 
df['search_result'] = mod_col

#displaying results
df['search_result'].head()

0     Severe Weather 101: Winter Weather Types  win...
1     Snow Storm Safety Blizzard: Sustained winds  ...
2     Winter Storms - UCAR Center for Science Educa...
3     Winter storm  winter storm   event  which win...
4     Winter Weather Winter storms including blizza...
Name: search_result, dtype: object

## Step 7: Remove the following five stop words: "are", "but", "very", "since", "could" using regular expression.

Next, we will remove the five words mentioned above.

In [11]:
#defining regex pattern
#\b word boundary anchor that ensures matched letters are part of the whole word
#(?: ... ) non-capturing group that groups list of words together
# | acts as OR operator

stop_words_pattern = r'\b(?:are|but|very|since|could)\b'

#creating empty list to save modified strings
mod_col = []

#removing pattern
for i in range(len(df['search_result'])):
    mod_str = re.sub(stop_words_pattern, '', df['search_result'][i])
    mod_col.append(mod_str)
    
#replacing the column values with the modified values 
df['search_result'] = mod_col

#displaying results
df['search_result'].head()

0     Severe Weather 101: Winter Weather Types  win...
1     Snow Storm Safety Blizzard: Sustained winds  ...
2     Winter Storms - UCAR Center for Science Educa...
3     Winter storm  winter storm   event  which win...
4     Winter Weather Winter storms including blizza...
Name: search_result, dtype: object

## Step 8: Remove all special characters, punctuation using a regular expression.

Finally, we will remove all special characters and punctutation.

In [12]:
#defining regex pattern
#^ negates the character set
#a-zA-Z0-9 matches any uppercase letter (A-Z), lowercase letter (a-z), or digit (0-9)
#\s matches white space
special_pattern = r'[^a-zA-Z0-9\s]'


#creating empty list to save modified strings
mod_col = []

#removing pattern
for i in range(len(df['search_result'])):
    mod_str = re.sub(special_pattern, '', df['search_result'][i])
    mod_col.append(mod_str)
    
#replacing the column values with the modified values 
df['search_result'] = mod_col

#displaying results
df['search_result'].head()

0     Severe Weather 101 Winter Weather Types  wint...
1     Snow Storm Safety Blizzard Sustained winds  f...
2     Winter Storms  UCAR Center for Science Educat...
3     Winter storm  winter storm   event  which win...
4     Winter Weather Winter storms including blizza...
Name: search_result, dtype: object

## Summary 

Let's look at our search_result column before and after modifications.

In [13]:
#creating data for dataframe
data = {'Before': search_result_before.values, 
        'After': df['search_result'].values}

#creating and displaying dataframe
df_comparison = pd.DataFrame(data)
df_comparison

Unnamed: 0,Before,After
0,https://www.nssl.noaa.gov/education/svrwx101/w...,Severe Weather 101 Winter Weather Types wint...
1,https://www.weather.gov/safety/winter-snow Sno...,Snow Storm Safety Blizzard Sustained winds f...
2,https://scied.ucar.edu/learning-zone/storms/wi...,Winter Storms UCAR Center for Science Educat...
3,https://en.wikipedia.org/wiki/Winter_storm Win...,Winter storm winter storm event which win...
4,https://www.ready.gov/winter-weather Winter We...,Winter Weather Winter storms including blizza...
...,...,...
94,https://www.motortrend.com/reviews/2022-rivian...,Rivian R1T Yearlong Review Winter Snowstorms ...
95,https://thebrownandwhite.com/2018/03/07/snowst...,Lehigh closes for third winter snowstorm Th...
96,https://espo.nasa.gov/impacts/content/IMPACTS ...,IMPACTS ESPO NASA Winter snowstorms freque...
97,https://www.kansas.com/news/local/article27033...,Here how the rare winter snowstorm affected ...
