# Data Analytics for Dublin Rental Scenario

###  Part 1: Scraping the data from the website
Upon screening the website(also HTML), it was found that the data has been nested within various sublinks i.e. 2024-Q1, 2024-Q2 etc. Each one of those have 26 more pages containing the rental data. There are majorly two ways of scraping data from paginated webpages - first being by appending query to the index URLs and second is by using different tags in the HTML content of the webpage. 

Here, we are trying to scrape the data using the tags from the HTML data of the website. 

##### Approach: 
> Scrape the hyperlinks from the parent website

> Scrape pagination links from those hyperlinks to create one single list

> Scrape all the information from the list of all pages using their HTML tags

> Save the scraped data in a CSV file

In [1]:
# Defining function which takes single string argument
def rentalScraper (url:str):
    
    # Importing requisite libraries
    import bs4                       # library for webscraping web data
    import urllib                    # library for making HTTP requests
    import csv                       # library for accessing CSV reading/writing capabilities

    
    # Fetching webpage's HTML content
    response = urllib.request.urlopen(url)                    # fetches the HTML content of the page
    html_data = response.read().decode()                      # reads and decodes the HTML content
    soup = bs4.BeautifulSoup(html_data, "html.parser")        # creating an BeautifulSoup object of the parsed HTML content

    
    # Extracting the hyperlink elements (Q1-2024, Q2-2024 etc.)
    hyperlinks_html = soup.find_all("a", href=True)           

    
    # Lists to store extracted links
    hyperlinks = []                                            # lists for Q1-2024, Q2-2024 etc.      
    pagelinks = []                                             # lists for all the pages within the hyperlinks collected

    
    # Base URL for constructing complete page URLs
    indexlink = "http://mlg.ucd.ie/modules/python/assignment1/rental/" 

    
    # Extracting and constructing complete URLs for quarterly rental data.
    for match in hyperlinks_html:
        quarter_links = match.get('href')
        hyperlinks.append(indexlink+quarter_links)            # Concatenating indexlink and the extracted quarter_links to create a complete URL
        '''print(quarter_links)'''                            # (Commented out) Used to check extracted links

        
    # Extracting and constructing complete URLs from all the pages of quarterly rental data
    for links in (hyperlinks[1:]):                                # Skipping the first as it leads to homepage of the website
        response = urllib.request.urlopen(links)                  # fetches the HTML content of the page
        html_data = response.read().decode()                      # reads and decodes the HTML content
        soup = bs4.BeautifulSoup(html_data, "html.parser")        # creating an BeautifulSoup object of the parsed HTML content

        
        # Finding all the paginated links on the page
        pagelinks_html = soup.find_all("a", href=True)            # finding all the <a> tags containing "href" attributes
        pagelinks.append(links)                                   # storing the first page of quarterly data    
        for match in pagelinks_html[1:(len(pagelinks_html)-1)]:   # storing the remaining pages excluding the "Next" pagelink
            pages = match['href']                                 
            pagelinks.append(indexlink + pages)                   # Concatenating indexlink and the extracted quarter_links to create a complete URL
            
    '''for i in pagelinks:                                        # (Commented out) Used to check extracted links
        print(i)'''

    
    # Lists to store extracted rental details
    price       = []
    month       = []
    prop_type   = []
    location    = []
    bedrooms    = []
    bathrooms   = []
    parking     = []
    garden      = []
    lease       = []
    contact     = [] 
    rental_info = []

    
    # Scraping rental details from each pagination link
    for links in pagelinks:
        response = urllib.request.urlopen(links)                               # fetches the HTML content of the page
        pagelink_html_data = response.read().decode()                          # reads and decodes the HTML content
        soup = bs4.BeautifulSoup(pagelink_html_data, "html.parser")            # creating an BeautifulSoup object of the parsed HTML content

        
        # Finding all <li> elements (list items) on the page
        tag_li = soup.find_all("li")
        for li in tag_li:
            month.append(soup.find("span", {"class":"record"}).text.strip())    # Extract and store the rental record month

            
            # Each rental property record is represented within a table, and the details are stored within <tr> tags.
            # Each <tr> tag contains multiple <td> tags, with the property details listed sequentially across rows
            # Since there are a total of 9 fields we will iterate it 9 times over a list i.e. in range (0,8)
            tag_tr = li.find_all("tr")
            for i in range (0,8):                                               
                if (i==0):
                    price.append((tag_tr[i].find_all("td")[1]).text.strip())
                elif (i==1):
                    prop_type.append((tag_tr[i].find_all("td")[1]).text.strip())
                elif (i==2):
                    location.append((tag_tr[i].find_all("td")[1]).text.strip())
                elif (i==3):
                    bedrooms.append((tag_tr[i].find_all("td")[1]).text.strip())
                elif (i==4):
                    bathrooms.append((tag_tr[i].find_all("td")[1]).text.strip())
                elif (i==5):
                    parking.append((tag_tr[i].find_all("td")[1]).text.strip())
                elif (i==6):
                    garden.append((tag_tr[i].find_all("td")[1]).text.strip())
                elif (i==7):
                    lease.append((tag_tr[i].find_all("td")[1]).text.strip())
                    
            contact.append((tag_tr[8].find_all("td")[1]).text.strip())

            
    # Storing the extracted data into a dictionary which will be used to create the CSV using DictWriter class of CSV module
    for i in range(len(month)):
        temp_dict = {"Month":month[i], 
                     "Price":price[i], 
                     "Property Type":prop_type[i], 
                     "Location":location[i], 
                     "Bedrooms":bedrooms[i], 
                     "Bathrooms":bathrooms[i], 
                     "Parking":parking[i], 
                     "Garden":garden[i], 
                     "Lease Length":lease[i], 
                     "Contact":contact[i]}
        
        rental_info.append(temp_dict)
        

    # Writing the data into a CSV file for further analysis
    with open ("DublinRental.csv", "w", encoding="utf-8") as file:
        fields = ["Month", "Price",                         # defining fields
                  "Property Type", 
                  "Location", 
                  "Bedrooms", 
                  "Bathrooms", 
                  "Parking", 
                  "Garden", 
                  "Lease Length", 
                  "Contact"]
        writer = csv.DictWriter(file, fieldnames=fields)    # creating a CSV writer object
        writer.writeheader()                                # writing the header row 
        for property in rental_info:                        # writing each rental record to CSV as a row
            writer.writerow(property)
    return (print(f"Process complete. \n{len(price)} rental records have been added to the CSV file 'DublinRental.csv'."))

### Running the function

In [2]:
rentalScraper("http://mlg.ucd.ie/modules/python/assignment1/rental/index.html")

Process complete. 
1950 rental records have been added to the CSV file 'DublinRental.csv'.


### Verifying the output file

In [145]:
with open ("DublinRental.csv", "r") as file:
    reader = csv.DictReader(file)
    for property in reader:
        print (f"{property}\n")

{'Month': 'January 2024', 'Price': '€ 7,200', 'Property Type': 'Apartment', 'Location': 'Dublin City South - Dublin 2', 'Bedrooms': '3', 'Bathrooms': '1 Bathroom', 'Parking': 'Yes', 'Garden': 'No', 'Lease Length': '3 months', 'Contact': 'Estate Agent'}

{'Month': 'January 2024', 'Price': '€ 2,960 per month', 'Property Type': 'Apartment', 'Location': 'Dublin City South - Dublin 24', 'Bedrooms': '2 Bedrooms', 'Bathrooms': '2 Bathrooms', 'Parking': 'Yes', 'Garden': '???', 'Lease Length': '3 months', 'Contact': 'Estate Agent'}

{'Month': 'January 2024', 'Price': '€1,920.00 per month', 'Property Type': 'Apartment', 'Location': 'Dublin City South - Dublin 24', 'Bedrooms': '2 Bedrooms', 'Bathrooms': '2 Bathrooms', 'Parking': 'No', 'Garden': 'No', 'Lease Length': '12 months', 'Contact': 'Estate Agent'}

{'Month': 'January 2024', 'Price': '€ 2,590', 'Property Type': 'Apartment', 'Location': 'Dublin City South - Dublin 6', 'Bedrooms': '2 Bedrooms', 'Bathrooms': '1 Bathroom', 'Parking': 'No', 'Ga