In [None]:
# default_exp scraper

# The Guardian Scraper

> Scraping Premier League Previews from the Guardian.

<div style="font-size: 200px">
    
|            Issues                 |          Solutions          |
|------------------------------     |-------------------|
|   4 possible formats for previews(old format, new format,Cup's format and a particular format) |Select the appropriate html tags|
|   Preview titles are not the same ( we can find Squad Sheets or match preview)|Pick only the names of the teams and eliminate the rest|
|   The date of the match is not always available |Pick the preview date|
|   The order of the elements and labels are not the same |Using regex patterns to get information|
|   Missing values for betting odds |We treat the general case separately and we set up specific regex patterns for these particular cases|
|   Odds format is different|We treat the general case separately and we set up specific regex patterns for these particular cases|
|   We can find non-numeric values for Odds like (Evens,evens,Eve)|Replace evens by 1-1|
|   There are some previews that don't have author and text|For previews that have no text, we put None (not available)|
|   The existence of previews for the FA CUP,Carabao Cup,Champions league,World Cup|Filter previews by title,link,topic,aside html section and preview text and allow only Premier League previews|
|   We are not sure if the names of the teams are the same as the ones in Opta|Set up a dictionary or check manually to map teams to their IDs|
|When we send many requests, the guardian server blocks your IP address, which is interpreted as a DDOS attack|Do a sleep of a random x seconds between requests or change your IP and work with rotating proxy|
</div >


### Import Libraries and Modules

In [None]:
# hide
from nbdev.showdoc import *

In [None]:
# export
import pandas as pd
import dateparser
import random
import requests
import logging
from guardian_scraper.Db_Config import *
from guardian_scraper.parser import *
from guardian_scraper.extractor import *
from guardian_scraper.mapper import *
from guardian_scraper.models.preview import *
from typing import Dict, Union
from requests_html import HTMLSession
from bs4 import BeautifulSoup
from time import sleep
from datetime import datetime

### ScrapingTheGuardian Class

##### This class represents a scraper from the "Guardian" website and has 4 functions:

1- <b> calculate_betting_odds </b> returns decimal odds.

&emsp;In this section, we will calculate the odds derived from the football preview.
<br>&emsp;Considering the following example:
<br>&emsp;&emsp; ["9-20","29-5","6-5"] 
<br>&emsp;&emsp;We calculate each sport's rating separately using the following formula:
<br>&emsp;&emsp;&emsp; home = (9/20) + 1 
<br>&emsp;&emsp;&emsp; away = (29/5) + 1
<br>&emsp;&emsp;&emsp; draw = (6/5) + 1
<br>&emsp;If we were successful in obtaining decimal odds, they will be returned in a Python dictionary.<br>&emsp;Otherwise, the values will be None (Not available).


2- <b> extract_preview_items </b>returns the entire contents of a football preview.

&emsp;In this section, we will call the functions defined in the PageExtractor class and return a Python dictionary containing all of this information.
<br>&emsp;But first, we use the <b>calculate_betting_odds</b> function to calculate the sports odds for the home team's victory, the away team's victory, and a draw.

"game Id","home team","away team","text","author","venue","referee","odds","odds home team","odds away team","odds draw", "preview date","game Date","preview_link" are the returned values.

3- <b> extract_previews </b> returns the information of all extracted previews.

&emsp;We retrieve all previews for a given page and go through them one by one, taking the link and getting its data using the "Guardian Api."
<br>&emsp;If the "Guardian Api" does not work, we will resort to the traditional process of html parsing. 
<br>&emsp;For each preview in the data extracted from the Guardian, we will look for the id of the home team and the away team and match it with the <b>opta.fixture</b><br>&emsp;database to get the gameID and gameDate and finally we save it in a MongoDb collection.
<br>&emsp;If the game exists, we will proceed with the data extraction.
The previous function, <b>extract_preview_items</b>, will be called here to extract information from <br>&emsp;each 
preview, which will then be stored in the list "all previews."
<br>&emsp;However, we will only extract previews that do not already exist in the database.

In [None]:
# export
class ScrapingTheGuardian:
    """
    A class to represent a scraper from the "Guardian" website.

    ...

    Attributes
    ----------
    session : requests_html.HTMLSession
        a web session
    VENUE_REGEX : str
        venue regex expression
    REFEREE_REGEX : str
        referee regex expression
    ODDS_REGEX : str
        odds regex expression

    Methods
    -------
    calculate_betting_odds(odds)
        returns decimal odds.
    extract_preview_items(content,link,preview_date,game_date,game_id,home_team,away_team)
        returns all information of a football preview.
    extract_previews(self,page,previews_last_date,last_preview,all_previews,df_teams)
        returns the information of all extracted previews.

    """

    # venue, referee, odds pattern regex
    # in some previews, all of the information is on the same line.
    VENUE_REGEX = "Venue(.*)Tickets|Venue(.*),|Venue(.*)"
    REFEREE_REGEX = "Referee(.*)This season's|Referee(.*)Last season's|Referee(.*)Odds|Referee[\s](.*)|Ref(.*)Odds"
    # {Odds H 11-8 A 11-8 D 11-8}
    # {Odds Liverpool 11-8 Aston Villa 11-8 Draw 11-8}
    # missing label {Odds H 11-8 11-8 D 11-8}
    # missing value {Odds H 11-8 A 11-8}
    ODDS_REGEX = "Odds[\s]*[a-zA-Z' ]*(\d{1,3}-[\s]*\d{1,3})[\s]*[a-zA-Z' ]*(\d{1,3}-[\s]*\d{1,3})([\s]*[a-zA-Z']*[\s]*(\d{1,3}-[\s]*\d{1,3}))*"

    def __init__(self):

        # Initialize session to start scraping
        self.session = HTMLSession()

    @staticmethod
    def calculate_betting_odds(odds: list) -> Dict[str, object]:
        """
          returns decimal odds.

        Parameters
        ----------
        odds: list of str
            odds values

        Returns
        -------
        betting_odds: dict of object

        """
        # Initialize betting odds to n/a (not available)
        # Some previews may not include odds
        odds_home = None
        odds_away = None
        odds_draw = None

        if odds is not None:  # If odds exist
            # example of odds:
            # {H 4-6 A 43-10 D 3-1}
            # {liverpool 4-6 Tottenham 43-10 Draw 3-1}
            # {H 4-6 43-10 D 3-1}
            # {H 4-6 A 43-10}
            # The formula will be (4/6)+1 , (43/10)+1 , (3/1)+1
            # Home team odds
            betting_odds_home = odds[0]
            try:
                odds_home = (
                    int(betting_odds_home.split("-")[0])
                    / int(betting_odds_home.split("-")[1])
                ) + 1
            except ZeroDivisionError:
                logging.error("Home team odds are wrong")
                pass
            # Away team odds
            betting_odds_away = odds[1]
            try:
                odds_away = (
                    int(betting_odds_away.split("-")[0])
                    / int(betting_odds_away.split("-")[1])
                ) + 1
            except ZeroDivisionError:
                logging.error("Away team odds are wrong")
                pass
            # if we have the normal format of odds
            # we will have 3 parts(odds_home,odds_away,odds_draw)
            if len(odds) >= 3:
                odds.pop(2)
                # Draw odds
                betting_odds_draw = odds[2]
                try:
                    odds_draw = (
                        int(betting_odds_draw.split("-")[0])
                        / int(betting_odds_draw.split("-")[1])
                    ) + 1
                except ZeroDivisionError:
                    logging.error("Draw odds are wrong")
                    pass

        betting_odds = dict(
            {"odds_home": odds_home, "odds_away": odds_away, "odds_draw": odds_draw}
        )
        return betting_odds

    @staticmethod
    def extract_preview_items(
        content: BeautifulSoup,
        link: str,
        preview_date: datetime,
        game_date: datetime,
        game_id: int,
        home_team: str,
        away_team: str,
        response_type: str,
    ) -> Dict[str, object]:
        """
          returns all information of a football preview

        Parameters
        ----------
        content: bs4.BeautifulSoup
            the html format of the preview content
        link: str
            the link of the preview
        preview_date: datetime
            the preview date
        game_date: datetime
            the game date
        game_id: int
            the game id
        home_team: str
            the home team name
        away_team: str
            the away team name
        response_type: str
            the parsing method('api' or 'html')

        Returns
        -------
        preview_items: dict of object

        """

        # meth1: extract match infos (venue,referee,odds)
        match_infos = PageExtractor.extract_match_infos(
            content,
            response_type,
            ScrapingTheGuardian.VENUE_REGEX,
            ScrapingTheGuardian.REFEREE_REGEX,
            ScrapingTheGuardian.ODDS_REGEX,
        )
        venue = match_infos["venue"]
        referee = match_infos["referee"]
        odds = match_infos["odds"]
        # meth2: extract text and author of the preview
        text_author = PageExtractor.extract_text_authors(content, response_type)
        text = text_author["text"]
        author = text_author["author"]
        # meth3: calculate betting odds
        betting_odds = ScrapingTheGuardian.calculate_betting_odds(odds)
        # Home team betting odds
        odds_home_team = betting_odds["odds_home"]
        # Away team betting odds
        odds_away_team = betting_odds["odds_away"]
        # Draw betting odds
        odds_draw = betting_odds["odds_draw"]
        # Return preview items
        preview_items = dict(
            {
                "game_id": game_id,
                "home_team": home_team,
                "away_team": away_team,
                "text": text,
                "author": author,
                "venue": venue,
                "referee": referee,
                "odds": odds,
                "odds_home_team": odds_home_team,
                "odds_away_team": odds_away_team,
                "odds_draw": odds_draw,
                "preview_date": preview_date,
                "game_date": game_date,
                "preview_link": link,
            }
        )
        return preview_items

    def extract_previews(
        self,
        page: BeautifulSoup,
        previews_last_date: datetime,
        last_preview: bool,
        all_previews: list,
        df_teams: pd.DataFrame,
    ) -> Union[bool, list]:
        """
          save all browsed previews in local

        Parameters
        ----------
        page: bs4.BeautifulSoup
            the html format of the page
        previews_last_date : datetime
            the last extracted preview date in the database
        last_preview: bool
            an indicator to know when we should stop the scraper
        all_previews: list
            a list that contains all extracted previews
        df_teams: pd.DataFrame
            a dataframe that contains teams and their different names

        Returns
        -------
        bool

        """
        # We pick all of the match previews on the webpage.
        previews = page.findAll("div", {"class": "fc-item__content"})
        # for each preview we extract its information.
        for preview in previews:
            # we pick the preview date and we parse it in a date format
            preview_date = preview.find("time")["datetime"]
            preview_date = dateparser.parse(preview_date, settings={"TIMEZONE": "UTC"})
            # if the date selected from the previews database exists
            # and has been reached by the preview date, we stop the loop
            # and mark last_preview as True.
            if previews_last_date and preview_date.date() <= previews_last_date.date():
                logging.info("The scraper turned off")
                last_preview = True
                break
            # Pick the preview link
            preview_link = preview.find("a")["href"]

            # We extract the last part of the link, which corresponds to the preview api link
            api_preview_url = preview_link.replace("https://www.theguardian.com", "")
            # request the api
            response = requests.get(
                "https://content.guardianapis.com/"
                + api_preview_url
                + "?api-key=fd4452e9-76a5-45a1-b30d-bdd156640b9c&show-blocks=all"
            )
            # if the api works we get the title and the content of the preview
            # else we extract html contents
            if response:
                logging.info("The Guardian Api works")
                # get the preview data
                data = response.json()
                # preview title
                preview_title = data["response"]["content"]["webTitle"]
                # preview content
                preview_content = BeautifulSoup(
                    data["response"]["content"]["blocks"]["body"][0]["bodyHtml"],
                    "html.parser",
                )
                # preview date
                preview_date = data["response"]["content"]["blocks"]["body"][0][
                    "createdDate"
                ]
                preview_date = dateparser.parse(
                    preview_date, settings={"TIMEZONE": "UTC"}
                )
                response_type = "api"

            else:
                logging.info("The Guardian Api does not work")
                preview_content = Parser.parse_page(preview_link, self.session)
                preview_title = preview_content.find("h1").text
                response_type = "html"

            # extract team names
            names = PageExtractor.extract_teams_names(preview_title)
            # Home team and  Away Team
            home_team = names["home"]
            away_team = names["away"]
            # get teams id
            home_team_id = PreviewsMapping.get_team_id(home_team, df_teams)
            away_team_id = PreviewsMapping.get_team_id(away_team, df_teams)
            # pick the preview date
            # get the id and the date of the game
            game = PreviewsMapping.get_game_id_date(
                home_team_id, away_team_id, preview_date
            )
            # if the game exists we extract the preview information
            if game != None:
                preview_infos = ScrapingTheGuardian.extract_preview_items(
                    preview_content,
                    preview_link,
                    preview_date,
                    game.gameDate,
                    game.gameId,
                    home_team,
                    away_team,
                    response_type,
                )
                logging.info("Returned Preview information: {}".format(preview_infos))
                # connect to database
                mongoengine_client = MongoClient.connect("1")
                # preview class
                preview = Previews(
                    gameId=preview_infos["game_id"],
                    homeTeam=preview_infos["home_team"],
                    awayTeam=preview_infos["away_team"],
                    text=preview_infos["text"],
                    author=preview_infos["author"],
                    venue=preview_infos["venue"],
                    referee=preview_infos["referee"],
                    odds=preview_infos["odds"],
                    oddsHomeTeam=preview_infos["odds_home_team"],
                    oddsAwayTeam=preview_infos["odds_away_team"],
                    oddsDraw=preview_infos["odds_draw"],
                    gameDate=preview_infos["game_date"],
                    previewDate=preview_infos["preview_date"],
                    previewLink=preview_infos["preview_link"],
                )
                # Validate and save input raw data
                MongoClient.save(preview)
                all_previews.append(preview_infos)
            
            else:
                logging.info("The game does not exist in the Opta database")
                
                
        return last_preview, all_previews

### Scraping all pages

In [None]:
# store scraper actions in a log file
logging.basicConfig(
    filename="scraper.log", level=logging.INFO, format="%(levelname)s:%(message)s"
)
# starting url
url = "https://www.theguardian.com/football/series/match-previews?page=36"
# initialize the scraper instance.
scraper = ScrapingTheGuardian()
# initially we are not at the last page.
last_page = False
# we'll extract the previews that haven't already been extracted.
last_preview = False
# we specify the last preview date in the previews collection on which the scraper will be turned off.
mongoengine_client = MongoClient.connect("1")
previews_last_date = Previews.objects().order_by("-previewDate", "-gameDate").first()

# if the database is empty we will scrap all pages
if previews_last_date != None:
    last_previews_date = previews_last_date.previewDate
else:
    last_previews_date = None
    
logging.info("The last preview date stored in the database is: {}".format(last_previews_date))
all_previews = []
# charging teams dictionary
df_teams = pd.read_csv(".//datasets//final_data.csv")
# if we are not at the last page
# and we haven't reached an extracted preview
# we launch the scraper
while not last_page and not last_preview:
    # a random timer
    time = random.randint(2, 60)
    logging.info("The scraper will wait {} seconds ...".format(time))
    # wait time seconds
    sleep(time)
    # get the html format of the page containing previews
    page = Parser.parse_page(url, scraper.session)
    # launch the scraper , extract previews information
    last_preview, all_previews = scraper.extract_previews(
        page, last_previews_date, last_preview, all_previews, df_teams
    )
    # get the url of the following page and verify if we are at the last page
    url, last_page = Parser.get_next_page(page)

In [None]:
data = pd.DataFrame(all_previews)

In [None]:
data

Unnamed: 0,game_id,home_team,away_team,text,author,venue,referee,odds,odds_home_team,odds_away_team,odds_draw,preview_date,game_date,preview_link
0,987970,Tottenham,Everton,Tottenham are euphoric after Wednesday’s Champ...,David Hytner,Tottenham H Stadium,Andre Marriner,"[5-4, 5-2, 5-2]",2.250000,3.500000,3.500000,2019-05-10 16:02:57+00:00,2019-05-12 14:00:00,https://www.theguardian.com/football/2019/may/...
1,987965,Fulham,Newcastle,Scott Parker’s future is no longer in question...,Benjy Nurick,Craven Cottage,Kevin Friend,"[6-4, 7-4, 5-2]",2.500000,2.750000,3.500000,2019-05-10 15:41:23+00:00,2019-05-12 14:00:00,https://www.theguardian.com/football/2019/may/...
2,987962,Brighton,Manchester City,Vincent Kompany’s unforgettable winner against...,Benjy Nurick,Amex Stadium,Michael Oliver,"[18-1, 2-11, 8-1]",19.000000,1.181818,9.000000,2019-05-10 15:26:50+00:00,2019-05-12 14:00:00,https://www.theguardian.com/football/2019/may/...
3,987963,Burnley,Arsenal,Arsenal will be glad to see the back of their ...,Graham Searles,Turf Moor,Mike Dean,"[23-10, 11-9, 3-1]",3.300000,2.222222,4.000000,2019-05-10 15:15:24+00:00,2019-05-12 14:00:00,https://www.theguardian.com/football/2019/may/...
4,987964,Crystal Palace,Bournemouth,"After a season of twists and turns, of giddy h...",Dominic Fifield,Selhurst Park,Roger East,"[10-11, 11-4, 3-1]",1.909091,3.750000,4.000000,2019-05-10 13:52:14+00:00,2019-05-12 14:00:00,https://www.theguardian.com/football/2019/may/...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
270,987636,Manchester City,Fulham,Fulham are the first of the promoted sides to ...,Jamie Jackson,Etihad Stadium,Stuart Atwell,"[1-7, 25-1, 10-1]",1.142857,26.000000,11.000000,2018-09-14 14:00:55+00:00,2018-09-15 14:00:00,https://www.theguardian.com/football/2018/sep/...
271,987635,Huddersfield Town,Crystal Palace,Both sides go into the game on the back of dis...,Paul Doyle,John Smith’s Stadium,Graham Scott,"[23-10, 11-5, 27-17]",3.300000,3.200000,2.588235,2018-09-14 12:35:57+00:00,2018-09-15 14:00:00,https://www.theguardian.com/football/2018/sep/...
272,987624,Cardiff City,Arsenal,Unai Emery has an excellent chance to record a...,Graham Searles,Cardiff City Stadium,Anthony Taylor,"[9-2, 4-7, 3-1]",5.500000,1.571429,4.000000,2018-09-01 08:00:48+00:00,2018-09-02 12:30:00,https://www.theguardian.com/football/2018/sep/...
273,987623,Burnley,Manchester United,Burnley have not made the blistering start the...,Paul Wilson,Turf Moor,Jonathan Moss,"[9-2, 4-6, 13-5]",5.500000,1.666667,3.600000,2018-09-01 08:00:48+00:00,2018-09-02 15:00:00,https://www.theguardian.com/football/2018/sep/...


In [None]:
data.to_csv("~//previews//dataset//previews.csv",index=False)