# Instructions

As a data scientist, you've found a data treasure on TampereBNB - the top online platform for short-term housing rentals in the city of Tampere. From thousands of listings, you can extract valuable insights to inform decision-making, marketing, and research. Your goal is to scrape and clean this data to enable further analysis.

## Accessing the HTML dataset
The quick way to get HTML data is by saving the HTML file to your computer manually. Also, a web page's HTML is known to change over time. Scraping code can break easily when web redesigns occur, which makes scraping brittle and not recommended for projects with longevity. So, to ease access and manage the traffic, we have scraped the TampereBNB website and extracted and stored data in the `data/accom` folder. Please, just use these HTML files provided to you and pretend like you saved them yourself. I recommend that you do and open the HTML files in your preferred text editor to inspect the HTML for the how the website structured information.

## TODO

- Students are expected to extract following information:
    - Region
    - Price of the accommodation
    - Apartment type
    - Square meters m2
    - Apartment floor
    - Construction year
    - Apartment status
    - The availability of an elevator
    - Longitude
    - Latitude
- change the data format of, if necessary:
    - Floor
    - Size
    - Construction
    - Longitude
    - Latitude
- Save the data frame in a pickle file

Screenshot of how the dataframe is expected to look like:

![df_screenshot](./data/images/df_screenshot.png) 
## Notes

<div class="alert alert-block alert-danger">
<b>Do not:</b> change the Jupyter Notebook file's name.
</div>

<div class="alert alert-block alert-danger">
<b>Use:</b> the given list as column names.
</div>

<div class="alert alert-block alert-danger">
<b>Save:</b> the dataframe as a pickle file with the name "TampereBNB.pkl".
</div>

<div class="alert alert-block alert-info">
<b>Tip:</b> Please keep in mind that although this template is designed in a linear style, generally, data wrangling is an iterative process.
</div>


# Reminder

This section is devoted to refreshing your memory on prerequisites to complete the assignment. 

## How does web scraping work? 
Website data is written in HTML (HyperText Markup Language) which uses tags to structure the page. Because HTML and its tags are just text, the text can be accessed using parsers . We'll be using a Python parser called [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/).


The following script is used to download HTML files programmatically:
```python
for listing_index in range(0, 1080):
     print("getting the" + listing_index + "url:")
     url = 'https://joda-tuni.azurewebsites.net/accom/M20%1d23' % listing_index
     print(url)
     page = requests.get(url)
     fname = 'data/accom/M20%1d23.html' % listing_index
     print(fname)
     with open(fname, 'wb') as f:
         f.write(page.content)
     time.sleep(1 + random.random() * 2)
```

## HTML file structure

The Hypertext Markup Language (or HTML) is the language used to create documents for the World Wide Web. You can use [w3school](https://www.w3schools.com/html/default.asp) to refresh your memory. 

The HTML element is everything from the start tag to the end tag: <br />

`<opening tag> content...</closing tag>`

### HTML Elements
#### Heading
elements are used for section headings.

```
<h1>Heading 1</h1>
<h2>Heading 2</h2>
<h3>Heading 3</h3>
<h4>Heading 4</h4>
<h5>Heading 5</h5>
<h6>Heading 6</h6>
```

#### Paragraph
elements are used for standard blocks of text

```
<p>This is just a block of text.</p>
```

#### Span
elements are used to group text within another block of text, often for styling.
```
<p>This block of text has a <span>element</span> inside it.</p>

```
#### Image
elements are used to embed images in a web page
```
<img src="image-file.jpg" alt="text that describes the image" />
```

### Trees represent hierarchical data

Web developers use trees to represent the data that makes up websites. Elements belong to each other or are descended from one another.


![Tree](./data/images/trees.png)

We can create a tree structure in HTML by putting elements inside other elements. To do this we often use a `<div>  `element as a container. `<div` elements are used to group chunks of content together.

```
<div>
    <h1>This is a heading.</h1>
    <p>This is a paragraph.</p>
</div>
```

![HTML&Tree](./data/images/HTML&Tree.png)


## Using Beautiful Soup

#### Use the Find, select Methods
`find()` is one of the most popular Beautiful Soup methods. It is similar to the find feature in a text editor. <br/>
`select_one()` which finds only the first tag that matches a selector. <br />
To find classes you can use: 
* `{'class': 'class_name'}`
* `class_='class_name'`
* `.class_name`

#### Example:
Here we attempt to use find method to get the title of the webpage:
```
soup.find('text_string')
```

When we apply this to get the title of our webpage:
```
soup.find('title')
```

We get the title element of the webpage with its opening and closing tags

```
<title>Tampere BNB</title>

```

To get the webpage title only, we can use `.text`:
```
soup.find('title').text
```

which gives us:
```
'Tampere BNB'
```


## Installing packages

using conda:
```python
    conda install <package name>
```

using pip:
```python
    pip install <package name>
```

using pip within Jupyter Notebook:
```python
    !pip install <package name>
```

# Gather

In [532]:
# Load packages
import os # Establish the interaction between the user and the operating system
import glob # Allows for Unix-style pathname pattern 

import re # Check whether a given string matches a given pattern
from bs4 import BeautifulSoup # Pull data out of HTML and XML files

import pandas as pd # Pulling data out of HTML and XML files

In [533]:
# Check currect directory
path = os.getcwd()
print ("The current working directory is %s" % path)

The current working directory is /Users/roope/repos/joda2024/week_04/5.2-scraper


In [534]:
# Reading files
indices = glob.glob('./data/accom/*.html')
len(indices)

1080

In [535]:
# Loading the data from local
local_files = list()
for fname in glob.glob('./data/accom/*.html'):
	local_files.append(fname.split('/')[-1])

In [536]:
# Please, use the following list as a column name
column = ["Region", "Price", "Type", "Size", "Floor", "Construction", "Condition", "Elevator", "Longitude", "Latitude"]

In [537]:
# Creating dictionary of lists with pre-specified columns/keys
accom = dict()
for col in column:
	accom[col] = list()
print(accom)

{'Region': [], 'Price': [], 'Type': [], 'Size': [], 'Floor': [], 'Construction': [], 'Condition': [], 'Elevator': [], 'Longitude': [], 'Latitude': []}


In [538]:
for fname in indices:
	with open(fname) as f:
		content = f.read()
	soup = BeautifulSoup(content, 'html.parser')
	page_content = soup.find('div', {'class': 'container'})
	
	# Get the price
	price = page_content.select_one('.basic-info .profile_price').text
	price = price.split(" ")[0].strip()
	accom["Price"].append(price)

	# Get the region
	region = soup.select('.map > h3')[1].get_text(strip=True)
	accom["Region"].append(region)

	# Get the apartment type
	apartment_type = soup.select('.about_space_type > p')[0].get_text(strip=True)
	accom["Type"].append(apartment_type)

	# Get the apartment size
	apartment_size = soup.select('.about_space_area > p')[0].get_text(strip=True)
	apartment_size = apartment_size.split(" ")[0].strip()
	accom["Size"].append(apartment_size)

	# Get the apartment floor
	apartment_floor = soup.select('.about_space_floor > p')[0].get_text(strip=True)
	apartment_floor = apartment_floor.split(" ")[4].strip()
	accom["Floor"].append(apartment_floor)

	# Get the apartment construction year
	construction_year = soup.select('.about_space_year > p')[0].get_text(strip=True)
	construction_year = construction_year.split(" ")[7].strip()
	accom["Construction"].append(construction_year)

	# Get the apartment status
	apartment_status = soup.select('.about_space_condition > p')[0].get_text(strip=True)
	apartment_status = apartment_status.split(" ")[13].strip()
	accom["Condition"].append(apartment_status)

	# Get the availability of the elevator
	elevator_availability = soup.select('.offers > tr > td')[0].get_text(strip=True)
	
	if (len(elevator_availability) == 9):
		accom["Elevator"].append("Yes")
	else:
		accom["Elevator"].append("No")

	# Get the longitude
	longitude = soup.select('.map > p')[0].get_text(strip=True)
	longitude = longitude.split(" ")[11].strip()
	accom["Longitude"].append(longitude)

	# Get the langitude
	latitude = soup.select('.map > p')[0].get_text(strip=True)
	latitude = latitude.split(" ")[15].strip()
	latitudeA = latitude.split(".")[0].strip()
	latitude = latitudeA + "." + latitude.split(".")[1].strip()
	accom["Latitude"].append(latitude)
	


In [539]:
# Constructing a DataFrame from a dict
df = pd.DataFrame.from_dict(accom)

# Access

Assessing your data is the second step in data wrangling. When assessing, you're like a detective at work, inspecting your dataset for two things: data quality issues (i.e. content issues) and lack of tidiness (i.e. structural issues).
- Quality: issues with content. Low quality data is also known as dirty data.
- Tidiness: issues with structure that prevent easy analysis. Untidy data is also known as messy data. Tidy data requirements:
    - Each variable forms a column.
    - Each observation forms a row.
    - Each type of observational unit forms a table.

<div class="alert alert-block alert-info">
<b>Tip:</b> This part is not graded and the codes are provided, you can just run the cells and access your dataset.
</div>


In [540]:
#loading pickle file and getting the first five rows
df.head()

Unnamed: 0,Region,Price,Type,Size,Floor,Construction,Condition,Elevator,Longitude,Latitude
0,"Tesoma, Tampere",€180,Three rooms,74.0,7,1966,unknown,No,23.61033840613356,61.51083191092233
1,"Takahuhti, Tampere",€243,Two rooms,57.5,4,1972,good,No,23.860745947641902,61.49763715174763
2,"Hervanta, Tampere",€177,Studio apartment,27.0,0,2019,good,No,23.84882877070141,61.45446021008072
3,"Niemenranta, Tampere",€153,Studio apartment,24.5,1,2017,good,No,23.690050674048194,61.51717521456669
4,"Tammela, Tampere",€363,Three rooms,73.0,4,1971,good,No,23.77786391016316,61.50136263215722


In [541]:
# Loading pickle file and getting the last five rows
df.tail()

Unnamed: 0,Region,Price,Type,Size,Floor,Construction,Condition,Elevator,Longitude,Latitude
1075,"Rautaharkko, Tampere",€168,Two rooms,42.0,0,1955,unknown,No,23.768104537978544,61.472621213004935
1076,"Vuores, Tampere",€399,Four rooms or more,95.0,-1,2012,good,No,23.808197301985487,61.43451271139134
1077,"Hämeenpuisto, Tampere",€222,Studio apartment,28.0,5,1908,good,No,23.75151211981315,61.49771017598682
1078,"Hakametsä, Tampere",€351,Three rooms,63.0,5,2012,good,No,23.829752825287773,61.48832464659127
1079,"Ikuri, Tampere",€366,Four rooms or more,141.0,0,1974,unknown,No,23.59902615472741,61.51101167244906


In [542]:
# Getting a summary of the data set
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1080 entries, 0 to 1079
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Region        1080 non-null   object
 1   Price         1080 non-null   object
 2   Type          1080 non-null   object
 3   Size          1080 non-null   object
 4   Floor         1080 non-null   object
 5   Construction  1080 non-null   object
 6   Condition     1080 non-null   object
 7   Elevator      1080 non-null   object
 8   Longitude     1080 non-null   object
 9   Latitude      1080 non-null   object
dtypes: object(10)
memory usage: 84.5+ KB


In [543]:
#check null values in Price feature
df[df['Price'].isnull()]

Unnamed: 0,Region,Price,Type,Size,Floor,Construction,Condition,Elevator,Longitude,Latitude


In [544]:
#check duplicated values
df[df.duplicated()]

Unnamed: 0,Region,Price,Type,Size,Floor,Construction,Condition,Elevator,Longitude,Latitude


In [545]:
#gettign a summary statistics of the construction feature
df[["Construction"]].describe()

Unnamed: 0,Construction
count,1080
unique,99
top,2022
freq,92


In [546]:
#gettign a series of unique values of apartment Type in feature
df["Type"].value_counts()

Type
Two rooms             463
Three rooms           262
Studio apartment      204
Four rooms or more    151
Name: count, dtype: int64

# Clean

In [547]:
#Make a copy of this object’s indices and data.
df_clean = df.copy()

In [548]:
#change the data format

#changing the data format of floor feature
df_clean["Floor"] = pd.to_numeric(df_clean["Floor"])

# Change the data format of Size feature
df_clean["Size"] = pd.to_numeric(df_clean["Size"])

# Change the data format of Longitude feature
df_clean["Longitude"] = pd.to_numeric(df_clean["Longitude"])

# Change the data format of Latitude feature
df_clean["Latitude"] = pd.to_numeric(df_clean["Latitude"])

# Change the data format of Construction feature
df_clean["Construction"] = pd.to_numeric(df_clean["Construction"])

In [549]:
# Check changes
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1080 entries, 0 to 1079
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Region        1080 non-null   object 
 1   Price         1080 non-null   object 
 2   Type          1080 non-null   object 
 3   Size          1080 non-null   float64
 4   Floor         1080 non-null   int64  
 5   Construction  1080 non-null   int64  
 6   Condition     1080 non-null   object 
 7   Elevator      1080 non-null   object 
 8   Longitude     1080 non-null   float64
 9   Latitude      1080 non-null   float64
dtypes: float64(3), int64(2), object(5)
memory usage: 84.5+ KB


In [550]:
# Save and load latest changes
df_clean.to_pickle("TampereBNB.pkl")
unpickled_df = pd.read_pickle("TampereBNB.pkl")
unpickled_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1080 entries, 0 to 1079
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Region        1080 non-null   object 
 1   Price         1080 non-null   object 
 2   Type          1080 non-null   object 
 3   Size          1080 non-null   float64
 4   Floor         1080 non-null   int64  
 5   Construction  1080 non-null   int64  
 6   Condition     1080 non-null   object 
 7   Elevator      1080 non-null   object 
 8   Longitude     1080 non-null   float64
 9   Latitude      1080 non-null   float64
dtypes: float64(3), int64(2), object(5)
memory usage: 84.5+ KB
