## Steam Webscrapper
![img](https://sm.pcmag.com/t/pcmag_in/how-to/2/21-steam-t/21-steam-tips-for-pc-gaming-noobs-and-power-users_khuz.1920.jpg)

> In this project I will be showing:
>- Scrapping Top 100 Topsellers from [steam](https://store.steampowered.com/)
>- Automating the web browser to load a steam's dynamic website using Selenium
>- Using requests, BeautifulSoup libraries to scrape the website 
>- Saving the scraped data to a csv file and loading it back in a dataframe using pandas
>- Sending the csv file to email using Python
>- Setting up a github action to automate daily web scrapping and sending csv file to email on a specified schedule
>- We will also do web crawling using requests

### Using Selenium to access the website

In [1]:
from selenium import webdriver 
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.chrome.options import Options

### Setting up chrome options
>- You add these options before starting the web browser
- For more: https://peter.sh/experiments/chromium-command-line-switches/

In [2]:
chrome_options = Options()

# start browser in background,either way you can remove this to show browser in action
# chrome_options.add_argument('--headless')

# overcome limited resource problems
chrome_options.add_argument('--disable-dev-shm-usage')

# bypass OS security model
chrome_options.add_argument('--no-sandbox')

# open Browser in maximized mode
chrome_options.add_argument("--start-maximized")

# open Browser in specified resolution as when in headless mode,
# maximized window size may change causing 'NoSuchElementException'
chrome_options.add_argument("--window-size=1920,1080")

# to ignore security problems
chrome_options.add_argument('--ignore-certificate-errors')
chrome_options.add_argument('--allow-running-insecure-content')


>- Download the latest web driver, it should be present in the same directory or added globally
- URL: https://chromedriver.chromium.org/downloads

In [3]:
url = 'https://store.steampowered.com/'
driver = webdriver.Chrome(options=chrome_options)
driver.maximize_window()
driver.get(url)


In [4]:
#when in headless this can be used to view the browser window as imaged file
driver.get_screenshot_as_file("screenshot.png")


True

### Finding and Intracting with element using Selenium

In [5]:
# select and click top sellers option
driver.find_element_by_xpath(
    '/html/body/div[1]/div[7]/div[5]/div[1]/div[1]/div/div[1]/div[8]/a[1]').click()

# get html document
html_text = driver.page_source


In [6]:
def start_driver():
    # initializing the chrome driver
    print('Starting chrome web driver')

    chrome_options = Options()
    chrome_options.add_argument('--headless')
    chrome_options.add_argument('--disable-dev-shm-usage')
    chrome_options.add_argument('--no-sandbox')
    chrome_options.add_argument("--start-maximized")
    chrome_options.add_argument("--window-size=1920,1080")
    chrome_options.add_argument('--ignore-certificate-errors')
    chrome_options.add_argument('--allow-running-insecure-content')

    driver = webdriver.Chrome(options=chrome_options)

    driver.get('https://store.steampowered.com/')  # opening browswer

    driver.maximize_window()  # maximize to view top sellers options

    print(driver.title)

    driver.find_element_by_xpath(
        '/html/body/div[1]/div[7]/div[5]/div[1]/div[1]/div/div[1]/div[8]/a[1]').click() # select and click top sellers option

    scroll_page(driver)
    html_text = driver.page_source  # get scrolled page contents

    return html_text


### Parsing Info using BeautifulSoup
>- We have the html document, let's convert it to BeautifulSoup object to parse the required info
>- To extract information from an html document, first convert the html text to a BeautifulSoup object then use find or find_all method to get required information
>- find() method returns first occurence of element specified, empty string if not found
>- find_all() method returns list of all descendent elements specified, empty list if not found 
>- Attribute can also be used to find elements

Refer : https://www.crummy.com/software/BeautifulSoup/bs4/doc/

In [7]:
from bs4 import BeautifulSoup

In [8]:
doc = BeautifulSoup(html_text)
titles_a_tags = doc.find('div',attrs={'id':'search_resultsRows'}).find_all('a')

In [9]:
len(titles_a_tags)

50

>- The page contains over 25k titles but we got only 50, because we need to scroll down to reveal as many titles as you wish to see from those 25k titles

In [10]:
import time
SCROLL_PAUSE_TIME = 0.5

# Get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")

#scroll for 20 seconds
scroll_time = 20
time_end = time.time() + scroll_time
while True and time.time() < time_end:
    # Scroll down to bottom
    driver.execute_script(
         "window.scrollTo(0, document.body.scrollHeight);")

     # Wait to load page
    time.sleep(SCROLL_PAUSE_TIME)

      # Calculate new scroll height and compare with last scroll height
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height


In [11]:
html_text = driver.page_source
doc = BeautifulSoup(html_text)
titles_a_tags = doc.find('div',attrs={'id':'search_resultsRows'}).find_all('a')
len(titles_a_tags)

1200

>- Let's also keep converting the code we wrote to a helper functions with exception handling to easily process all the titles 

In [12]:
def scroll_page(driver):
    print('Scrolling webpage')
    SCROLL_PAUSE_TIME = 0.5

    # Get scroll height
    last_height = driver.execute_script("return document.body.scrollHeight")

    # scroll for 20 seconds
    time_end = time.time() + 20
    while True and time.time() < time_end:
        # Scroll down to bottom
        driver.execute_script(
            "window.scrollTo(0, document.body.scrollHeight);")

        # Wait to load page
        time.sleep(SCROLL_PAUSE_TIME)

        # Calculate new scroll height and compare with last scroll height
        new_height = driver.execute_script("return document.body.scrollHeight")
        if new_height == last_height:
            break
        last_height = new_height


>- We were able to grab 1k+ title after scrolling for 20 seconds, let's parse the 50 titles then we do the same for other titles

In [13]:
titles_a_tags[0]

<a class="search_result_row ds_collapse_flag app_impression_tracked" data-ds-appid="1245620" data-ds-crtrids="[33042543]" data-ds-descids="[2,5]" data-ds-itemkey="App_1245620" data-ds-steam-deck-compat-handled="true" data-ds-tagids="[29482,4604,122,4026,4231,1697,1654]" data-gpnav="item" href="https://store.steampowered.com/app/1245620/ELDEN_RING/?snr=1_7_7_7000_150_1" onmouseout="HideGameHover( this, event, 'global_hover' )" onmouseover="GameHover( this, event, 'global_hover', {&quot;type&quot;:&quot;app&quot;,&quot;id&quot;:1245620,&quot;public&quot;:1,&quot;v6&quot;:1} );">
<div class="col search_capsule"><img src="https://cdn.akamai.steamstatic.com/steam/apps/1245620/capsule_sm_120.jpg?t=1645600407" srcset="https://cdn.akamai.steamstatic.com/steam/apps/1245620/capsule_sm_120.jpg?t=1645600407 1x, https://cdn.akamai.steamstatic.com/steam/apps/1245620/capsule_231x87.jpg?t=1645600407 2x"/></div>
<div class="responsive_search_name_combined">
<div class="col search_name ellipsis">
<span 

>- Using inspect option, we can find the corresponding attributes like class,id etc through which we can we find the element

In [14]:
#using text to grab only the text in between the tags and strip() to remove white spaces(tabs, newline, spaces etc)
title_name = titles_a_tags[0].find('span', class_='title').text.strip() 
title_name

'ELDEN RING'

In [15]:
title_date = titles_a_tags[0].find(
    'div', class_='col search_released responsive_secondrow').text.strip().split(',')
title_date = title_date[0] + title_date[-1]
title_date


'25 Feb 2022'

In [16]:
title_price = titles_a_tags[0].find(
           'div', class_='col search_price responsive_secondrow')
if not title_price:
            # find price latest price with discount
    title_price = titles_a_tags[0].find(
                'div', class_='col search_price discounted responsive_secondrow')
#title_price = title_price.text.split('₹')[-1].strip()
title_price = title_price.text.strip()
price =  [ i  for i in title_price if i.isalnum()]
title_price = float(''.join(price))
title_price

2499.0

In [17]:
print('Title Name:',title_name)
print('Title price:',title_price)
print('Title date:',title_date)

Title Name: ELDEN RING
Title price: 2499.0
Title date: 25 Feb 2022


In [18]:
def parse_game_titles(title):
    '''
    Funtion to parse a tags to get Name,date,price
    '''
    try:
        title_name = title.find('span', class_='title').text.strip()

        print('Parsing Game title: {}'.format(title_name))

        title_date = title.find(
            'div', class_='col search_released responsive_secondrow').text.strip().split(',')
        title_date = title_date[0] + title_date[-1]

        # find price without discount
        title_price = title.find(
            'div', class_='col search_price responsive_secondrow')
        if not title_price:
            # find price latest price with discount
            title_price = title.find(
                'div', class_='col search_price discounted responsive_secondrow')
        title_price = title_price.text.split()[-1]
        #title_price = title_price.text.strip()
        price = [i for i in title_price if i.isalnum()]
        title_price = float(''.join(price))
        title_price

    except:
        title_name, title_date, title_price = 'NA', 'NA', 'NA'

    return {
        'title_name': '"' + str(title_name) + '"',
        'realease_date': '"' + str(title_date) + '"',
        'price': '"' + str(title_price) + '"'
    }


In [19]:
parse_game_titles(titles_a_tags[32])


Parsing Game title: God of War


{'title_name': '"God of War"',
 'realease_date': '"14 Jan 2022"',
 'price': '"3299.0"'}

>- Great, we got the details from the titles page, let's try to grab more details for individual titles
>- To do that one approach one can try we can go to invidual title then get it's 'href' attribute, then use requests to go to that 'href' which is basically a link
>- Once we are the title's page we can inspect and grab the info we need we will grab title description,developer,publisher,review, rating and tags related to that game title
>- This process of going from one page to another is called *Crawling*,  there are others tools like spider to do so we will do a simple crawling with requests

In [20]:
#to grab a attribute's value we can use ['attribute_name'] to specify attribute as key and get it's value
href = titles_a_tags[0]['href']

### Using requests to get the page

In [21]:
import requests

In [22]:
response = requests.get(href)
print(response.status_code) #checking status code


200


>- With requests we can perform any http method call like get,post,put,patch etc
>- Requests can also be used to call APIs, to pass api keys and to do other authentications
>- Each request returns a status code stating weather the call was successful( mostly begins with 2) or unsuccessful(begins with 4 mostly)

- Status Codes
>- Informational responses (100–199)
>- Successful responses (200–299)
>- Redirection messages (300–399)
>- Client error responses (400–499)
>- Server error responses (500–599)

Refer : https://developer.mozilla.org/en-US/docs/Web/HTTP/Status 

In [23]:
#response.text to get html document
doc = BeautifulSoup(response.text)

In [24]:
title_descp = doc.find(
    'div', class_='game_description_snippet').text.strip()
title_descp


'THE NEW FANTASY ACTION RPG. Rise, Tarnished, and be guided by grace to brandish the power of the Elden Ring and become an Elden Lord in the Lands Between.'

In [25]:
title_review = doc.find('div', 'summary column')
title_review


<div class="summary column">
																					No user reviews																				</div>

In [26]:
title_review = list(map(lambda x: x.strip(), title_review.text.split('\n')))
title_review


['', 'No user reviews']

In [27]:
title_review = [tag for tag in title_review if tag]
title_review


['No user reviews']

In [28]:
if len(title_review) != 1:
        title_review, title_rating = title_review[:2]
else:
        title_rating = 0
title_rating


0

In [29]:
developer = doc.find('div', class_='dev_row')
developer


<div class="dev_row">
<div class="subtitle column">Developer:</div>
<div class="summary column" id="developers_list">
<a href="https://store.steampowered.com/developer/BANDAINAMCO?snr=1_5_9__2000">FromSoftware Inc.</a> </div>
</div>

In [30]:
developer = developer.text.strip().split()[1]
developer


'FromSoftware'

In [31]:
publisher = doc.find('div', class_='dev_row').find_next_sibling()
publisher


<div class="dev_row">
<div class="subtitle column">Publisher:</div>
<div class="summary column">
<a href="https://store.steampowered.com/publisher/BANDAINAMCO?snr=1_5_9__2000">FromSoftware Inc.</a>, <a href="https://store.steampowered.com/publisher/BANDAINAMCO?snr=1_5_9__2000">BANDAI NAMCO Entertainment</a> </div>
</div>

In [32]:
publisher = publisher.text.strip().split()[1:]


In [33]:
publisher = ' '.join(publisher)
publisher


'FromSoftware Inc., BANDAI NAMCO Entertainment'

In [34]:
title_tags = doc.find('div', class_='glance_tags popular_tags')
title_tags


<div class="glance_tags popular_tags" data-appid="1245620" data-panel='{"flow-children":"row"}'>
<a class="app_tag" href="https://store.steampowered.com/tags/en/Souls-like/?snr=1_5_9__409" style="display: none;">
												Souls-like												</a><a class="app_tag" href="https://store.steampowered.com/tags/en/Dark%20Fantasy/?snr=1_5_9__409" style="display: none;">
												Dark Fantasy												</a><a class="app_tag" href="https://store.steampowered.com/tags/en/RPG/?snr=1_5_9__409" style="display: none;">
												RPG												</a><a class="app_tag" href="https://store.steampowered.com/tags/en/Difficult/?snr=1_5_9__409" style="display: none;">
												Difficult												</a><a class="app_tag" href="https://store.steampowered.com/tags/en/Action%20RPG/?snr=1_5_9__409" style="display: none;">
												Action RPG												</a><a class="app_tag" href="https://store.steampowered.com/tags/en/Third%20Person/?snr=1_5_9__409" style="display: none;">
												Third Per

In [35]:
title_tags  = list(map(lambda x: x.strip(), title_tags.text.split('\t')))
title_tags


['',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 'Souls-like',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 'Dark Fantasy',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 'RPG',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 'Difficult',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 'Action RPG',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 'Third Person',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 'Relaxing',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 'Action',
 '',
 '',
 '',
 '',
 '',


In [36]:
title_tags = [tag for tag in title_tags if tag]
title_tags


['Souls-like',
 'Dark Fantasy',
 'RPG',
 'Difficult',
 'Action RPG',
 'Third Person',
 'Relaxing',
 'Action',
 'Fantasy',
 'Online Co-Op',
 'Multiplayer',
 'Singleplayer',
 'Co-op',
 'PvP',
 'Violent',
 '3D',
 'Open World',
 'Great Soundtrack',
 'Atmospheric',
 'Walking Simulator',
 '+']

In [37]:
title_tags = ','.join(title_tags)
title_tags


'Souls-like,Dark Fantasy,RPG,Difficult,Action RPG,Third Person,Relaxing,Action,Fantasy,Online Co-Op,Multiplayer,Singleplayer,Co-op,PvP,Violent,3D,Open World,Great Soundtrack,Atmospheric,Walking Simulator,+'

In [38]:
print('Title\'s Description:',title_descp)
print('Title\'s Review:',title_review)
print('Title\'s Rating:',title_rating)
print('Title\'s Developer:',developer)
print('Title\'s Publisher:',publisher)
print('Title\'s Tags:',title_tags)

Title's Description: THE NEW FANTASY ACTION RPG. Rise, Tarnished, and be guided by grace to brandish the power of the Elden Ring and become an Elden Lord in the Lands Between.
Title's Review: ['No user reviews']
Title's Rating: 0
Title's Developer: FromSoftware
Title's Publisher: FromSoftware Inc., BANDAI NAMCO Entertainment
Title's Tags: Souls-like,Dark Fantasy,RPG,Difficult,Action RPG,Third Person,Relaxing,Action,Fantasy,Online Co-Op,Multiplayer,Singleplayer,Co-op,PvP,Violent,3D,Open World,Great Soundtrack,Atmospheric,Walking Simulator,+


In [39]:
def clean_review_tags(tag_element, char):  # to parse reviews and game tags
    tag_element = list(map(lambda x: x.strip(), tag_element.text.split(char)))
    tag_element = [tag for tag in tag_element if tag]
    return tag_element


def parse_game_href(href):

    doc = get_page(href)
    print('Parsing Game titles\'s \'href:\'{} '.format(doc.title.text))

    try:
        title_descp = doc.find(
            'div', class_='game_description_snippet')

        if title_descp:    
            title_descp = title_descp.text.strip()

        title_review = doc.find('div', 'summary column')
        title_review = clean_review_tags(title_review, '\n')

        if len(title_review) != 1:
            title_review, title_rating = title_review[:2]
        else:
            title_rating = 0

        developer = doc.find('div', class_='dev_row')
        developer = developer.text.strip().split()[1:]
        developer = ' '.join(developer)


        publisher = doc.find('div', class_='dev_row').find_next_sibling()
        publisher = publisher.text.strip().split()[1:]
        publisher = ' '.join(publisher)

        title_tags = doc.find('div', class_='glance_tags popular_tags')
        title_tags = clean_review_tags(title_tags, '\t')
        title_tags = ','.join(title_tags)

    except Exception as a:
        title_descp, title_review, title_rating, developer, publisher, title_tags = [
            'NA']*6
        print(a)

    return {
        'description': '"' + str(title_descp) + '"',
        'title_review': '"' + str(title_review) + '"',
        'title_rating': '"' + str(title_rating) + '"',
        'developer': '"' + str(developer) + '"',
        'publisher': '"' + str(publisher) + '"',
        'game_tags': '"' + str(title_tags) + '"'
    }


In [40]:
def get_page(url):
    response = requests.get(url)
    if response.ok:
        doc = BeautifulSoup(response.text, features='html.parser')
    else:
        print('Page Loading Failed!')
        
    return doc


In [41]:
parse_game_href(titles_a_tags[45]['href'])

Parsing Game titles's 'href:'Rust on Steam 


{'description': '"The only aim in Rust is to survive. Everything wants you to die - the island’s wildlife and other inhabitants, the environment, other survivors. Do whatever it takes to last another night."',
 'title_review': '"Very Positive"',
 'title_rating': '"(14,830)"',
 'developer': '"Facepunch Studios"',
 'publisher': '"Facepunch Studios"',
 'game_tags': '"Survival,Crafting,Multiplayer,Open World,Open World Survival Craft,Building,Sandbox,PvP,Adventure,First-Person,Action,FPS,Nudity,Co-op,Shooter,Online Co-Op,Indie,Early Access,Post-apocalyptic,Simulation,+"'}

In [42]:
parsed_titles = [parse_game_titles(title) for title in titles_a_tags[:10]]
parsed_game_href = [parse_game_href(tag['href']) for tag in titles_a_tags[:10]]

# to merge two dict into one, use '|' for python version > 3.10.0
game_info = [title | title_info for title, title_info in zip(
    parsed_titles, parsed_game_href)]


Parsing Game title: ELDEN RING
Parsing Game title: ELDEN RING
Parsing Game title: Destiny 2: The Witch Queen Deluxe Edition
Parsing Game title: Destiny 2: The Witch Queen
Parsing Game title: Destiny 2: The Witch Queen Deluxe + Bungie 30th Anniversary Bundle
Parsing Game title: Total War: WARHAMMER III
Parsing Game title: Cyberpunk 2077
Parsing Game title: Dying Light 2 Stay Human
Parsing Game title: Dread Hunger
Parsing Game title: NARAKA: BLADEPOINT
Parsing Game titles's 'href:'Pre-purchase ELDEN RING on Steam 
Parsing Game titles's 'href:'Pre-purchase ELDEN RING on Steam 
Parsing Game titles's 'href:'Destiny 2: The Witch Queen Deluxe Edition on Steam 
Parsing Game titles's 'href:'Destiny 2: The Witch Queen on Steam 
Parsing Game titles's 'href:'Destiny 2: The Witch Queen Deluxe + Bungie 30th Anniversary Bundle on Steam 
Parsing Game titles's 'href:'Total War: WARHAMMER III on Steam 
Parsing Game titles's 'href:'Save 50% on Cyberpunk 2077 on Steam 
Parsing Game titles's 'href:'Dying L

In [43]:
game_info[6:8]

[{'title_name': '"Cyberpunk 2077"',
  'realease_date': '"10 Dec 2020"',
  'price': '"1499.0"',
  'description': '"Cyberpunk 2077 is an open-world, action-adventure RPG set in the dark future of Night City — a dangerous megalopolis obsessed with power, glamor, and ceaseless body modification."',
  'title_review': '"Mostly Positive"',
  'title_rating': '"(8,715)"',
  'developer': '"CD PROJEKT RED"',
  'publisher': '"CD PROJEKT RED"',
  'game_tags': '"Cyberpunk,Open World,RPG,Sci-fi,Futuristic,Singleplayer,Nudity,FPS,First-Person,Atmospheric,Story Rich,Exploration,Mature,Action,Violent,Great Soundtrack,Action RPG,Adventure,Character Customization,Immersive Sim,+"'},
 {'title_name': '"Dying Light 2 Stay Human"',
  'realease_date': '"4 Feb 2022"',
  'price': '"2999.0"',
  'description': '"The virus won and civilization has fallen back to the Dark Ages. The City, one of the last human settlements, is on the brink of collapse. Use your agility and combat skills to survive, and reshape the wor

### Writing Data to csv file 
>- Now we can write this data into a csv file

In [44]:
with open('./steam_data_tutorial.csv', 'w', encoding="utf-8") as f:

    headers = list(game_info[0].keys())
    f.write(','.join(headers)+'\n')

    for item in game_info:
        values = []
        #you can iterate through item too
        for header in headers:
            values.append(str(item.get(header, '')))
        f.write(','.join(values)+'\n')

In [45]:
import os

In [46]:
os.listdir('.')

['.git',
 '.github',
 'chromedriver',
 'chromedriver.exe',
 'README.md',
 'requirements.txt',
 'scrapper.py',
 'steam_data_tutorial.csv',
 'Steam_scrapper_Tutorial.ipynb']

In [47]:
def write_to_csv(info, path='./steam_data.csv'):
    with open(path, 'w', encoding="utf-8") as f:
        if(len(info) == 0):
            print('Nothing to write!')
            return
        headers = list(info[0].keys())
        f.write(','.join(headers)+'\n')

        for item in info:
            values = []
            for header in headers:
                values.append(str(item.get(header, '')))
            f.write(','.join(values)+'\n')


### Loading the csv file as dataframe using pandas

In [48]:
import pandas as pd

In [49]:
steam_df = pd.read_csv('./steam_data_tutorial.csv')


In [50]:
steam_df.sample(5)

Unnamed: 0,title_name,realease_date,price,description,title_review,title_rating,developer,publisher,game_tags
5,Total War: WARHAMMER III,17 Feb 2022,3399.0,The cataclysmic conclusion to the Total War: W...,Mixed,"(18,214)",CREATIVE ASSEMBLY,"SEGA, Feral Interactive","Strategy,Grand Strategy,Turn-Based Strategy,RT..."
3,Destiny 2: The Witch Queen,22 Feb 2022,899.0,,Very Positive,(249),Bungie,Bungie,"Action,Adventure,Free to Play,Looter Shooter,F..."
9,NARAKA: BLADEPOINT,12 Aug 2021,839.0,BRUCE LEE and his iconic Nunchucks now have jo...,Mostly Positive,"(3,811)",24 Entertainment,NetEase Games Montréal,"Battle Royale,Multiplayer,Sexual Content,Marti..."
7,Dying Light 2 Stay Human,4 Feb 2022,2999.0,The virus won and civilization has fallen back...,Very Positive,"(56,600)",Techland,Techland,"Open World,Zombies,Parkour,Co-op,Multiplayer,S..."
0,ELDEN RING,25 Feb 2022,2499.0,"THE NEW FANTASY ACTION RPG. Rise, Tarnished, a...",['No user reviews'],0,FromSoftware Inc.,"FromSoftware Inc., BANDAI NAMCO Entertainment","Souls-like,Dark Fantasy,RPG,Difficult,Action R..."


### Sending the csv file to email 
- Wow, now that the csv is ready we can send this to email using SMTP
- Create a dummy google account with [less secure access enabled](https://myaccount.google.com/lesssecureapps) from which email is to be sent and the recieving email could be any email of your choice

Refer : 
- https://stackoverflow.com/questions/23171140/how-do-i-send-an-email-with-a-csv-attachment-using-python
- https://docs.python.org/3/library/email.mime.html
- https://www.courier.com/blog/three-ways-to-send-emails-using-python-with-code-tutorials/

In [51]:
from email.mime.text import MIMEText
import mimetypes
from email.mime.multipart import MIMEMultipart
from email import encoders
import smtplib


def send_email():
    print('Sending email..')

    emailfrom = 'from@gmail.com'
    emailto = "to@gmail.com"
    fileToSend = "steam_data_tutorial.csv"
    username = "from email username"
    password = "from email username"

    msg = MIMEMultipart() #to build a multi part msg(email)
    msg["From"] = emailfrom
    msg["To"] = emailto
    msg["Subject"] = "Top 100 Topsellers from Steam"

    ctype, encoding = mimetypes.guess_type(fileToSend)
    if ctype is None or encoding is not None:
        ctype = "application/octet-stream"

    maintype, subtype = ctype.split("/", 1)

    fp = open(fileToSend)
    # Note: we should handle calculating the charset
    attachment = MIMEText(fp.read(), _subtype=subtype)
    fp.close()
    encoders.encode_base64(attachment)

    attachment.add_header("Content-Disposition",
                          "attachment", filename=fileToSend)
    msg.attach(attachment)

    try:
        server = smtplib.SMTP_SSL('smtp.gmail.com', 465)
        server.login(username, password)
        server.sendmail(emailfrom, emailto, msg.as_string())
        server.quit()
        print('Email Sent!')
    except:
        print('Sending Email Failed!')


In [52]:
send_email()

Sending email..
Sending Email Failed!


>### Setting up a github action
>
- Once all the methods are done creating, put them in separate 'file.py' like I did in 'scrapper.py' 
- We are good to push the code to github

> Follow these commands to push the code to github
- Create a empty repository on github with name of your choice say 'scrapper'
- Make sure git is setup on your device (https://www.atlassian.com/git/tutorials/install-git)
- Then initialize an empty repo locally using your terminal/cmd prompt


  >Follow these comamands in your local device
  - Navigate to directory you wish to initiliaze with git using cd 'directorypath'
  - Then  `git init`
  - `git add .`   (to track all files()
  - `git commit -m "short msg title" -m "short msg description" `, (to commit files)
  - Copy the ssh\http link from repo you created on github
  - Then `git remote add remote_name ssh\http_link_here`
  - `git remote -v` (to check remote added)
  - `git push remote_name branch_name` (finally to push all the code to github)

>- Creating a github action
- After to all this is done congrats you just added version control to your code
- Now to set up a github action go to action tab then click "set up a workflow yourself" to set up a new work flow
- Cofigure them .yml file to set up your own work flow like this


In workflow_name.yml file
```
#name of action
name: action_name  

#on which event to run workflow like on push,new pr or scheuled time
on:

  #scheduled time to run in cron format
  schedule:
    - cron: "0 0 * * *" #runs at 00:00 UTC everyday

  #Allows you to run this workflow manually from the Actions tab
  workflow_dispatch:

  #Triggers the workflow on push events but only for the main branch
  push: 
    branches: [main]

#Allows set sensitive data such api keys,pasword hidden 

env:
#if file.py requires passwords..etc, set it as secret
  key: ${{ secrets.key }}
  
#actions or jobs to perform
jobs:

  #builds on which os
      build:
        runs-on: ubuntu-latest  
  
  #order of jobs
        steps:
        
         #job name 
          - name: checkout repo content
            uses: actions/checkout@v2 # checkout the repository content to github runner.
          
          - name: setup python
            uses: actions/setup-python@v2
            with:
              python-version: 3.10.0 #install the python needed
        
          - name: execute py script # run the run.py to get the latest data
            
          #commands to execute on terminal/cmd
            run: |
              google-chrome --version
              chromedriver --version
              pip install -r requirements.txt
              python file.py 
```

- After doing this go to action click action_name then click on `Run Workflow` to trigger the workflow manually
- On successfull/unsuccesful build, see logs to debug any issue you may face 
- The workflow will automatically run on scheduled time which is very cool, you can check the logs after workflow has been executed.

## Wrapping Up!
>- Learned soo much while doing project, quick summary of work we did
- Used selenium to run the browser to open the steam website
- Navigated to topsellers page, scrapped game titles for info like name,price,date of game title using beautiful soup
- Using requests to get page for each tiltle then scrapped info like review,rating,description,tags,publisher,developer of game
- Saved all of this data to a csv file 
- Loading it back using pandas
- Sent the csv file using SMTP
- Added version control using git
- Made a github action to automate whole workflow from opening browser, scrapper info, saving csv then sending it to email on scheduled time


>- Future work
- Do an EDA on the build dataset
- Use multithreading to process the script faster
- Add docker support
- Automate using aws lambda

>- References 
- https://github.community/t/how-to-setup-github-actions-to-run-my-python-script-on-schedule/18335/2
- https://medium.datadriveninvestor.com/accessing-github-secrets-in-python-d3e758d8089b
- https://pandas.pydata.org/docs/reference/frame.html#


>- Here are some things to keep in mind w.r.t. web scraping:

- Most websites disallow web scraping for commercial purposes
- Prefer using web scraping only for learning and research purposes
- Some websites may block your IP or stop sending valid information if you send too many requests
- Review the terms and conditions of a website before scraping data from it
- Remove sensitive and personally identifiable information before publishing a dataset online
- Use official REST APIs wherever available, with proper API keys
- Scraping data that you see after logging in is harder (it requires special cookies and headers)
- Websites change their HTML layout frequently, which may cause your scarping scripts to break