# Instructions

As a data scientist, you've found a data treasure on TampereBNB - the top online platform for short-term housing rentals in the city of Tampere. From thousands of listings, you can extract valuable insights to inform decision-making, marketing, and research. Your goal is to scrape and clean this data to enable further analysis.

## Accessing the HTML dataset
The quick way to get HTML data is by saving the HTML file to your computer manually. Also, a web page's HTML is known to change over time. Scraping code can break easily when web redesigns occur, which makes scraping brittle and not recommended for projects with longevity. So, to ease access and manage the traffic, we have scraped the TampereBNB website and extracted and stored data in the `data/accom` folder. Please, just use these HTML files provided to you and pretend like you saved them yourself. I recommend that you do and open the HTML files in your preferred text editor to inspect the HTML for the how the website structured information.

## TODO

- Students are expected to extract following information:
    - Region
    - Price of the accommodation
    - Apartment type
    - Square meters m2
    - Apartment floor
    - Construction year
    - Apartment status
    - The availability of an elevator
    - Longitude
    - Latitude
- change the data format of, if necessary:
    - Floor
    - Size
    - Construction
    - Longitude
    - Latitude
- Save the data frame in a pickle file

Screenshot of how the dataframe is expected to look like:

![df_screenshot](./data/images/df_screenshot.png) 
## Notes

<div class="alert alert-block alert-danger">
<b>Do not:</b> change the Jupyter Notebook file's name.
</div>

<div class="alert alert-block alert-danger">
<b>Use:</b> the given list as column names.
</div>

<div class="alert alert-block alert-danger">
<b>Save:</b> the dataframe as a pickle file with the name "TampereBNB.pkl".
</div>

<div class="alert alert-block alert-info">
<b>Tip:</b> Please keep in mind that although this template is designed in a linear style, generally, data wrangling is an iterative process.
</div>


# Reminder

This section is devoted to refreshing your memory on prerequisites to complete the assignment. 

## How does web scraping work? 
Website data is written in HTML (HyperText Markup Language) which uses tags to structure the page. Because HTML and its tags are just text, the text can be accessed using parsers . We'll be using a Python parser called [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/).


The following script is used to download HTML files programmatically:
```python
for listing_index in range(0, 1080):
     print("getting the" + listing_index + "url:")
     url = 'https://joda-tuni.azurewebsites.net/accom/M20%1d23' % listing_index
     print(url)
     page = requests.get(url)
     fname = 'data/accom/M20%1d23.html' % listing_index
     print(fname)
     with open(fname, 'wb') as f:
         f.write(page.content)
     time.sleep(1 + random.random() * 2)
```

## HTML file structure

The Hypertext Markup Language (or HTML) is the language used to create documents for the World Wide Web. You can use [w3school](https://www.w3schools.com/html/default.asp) to refresh your memory. 

The HTML element is everything from the start tag to the end tag: <br />

`<opening tag> content...</closing tag>`

### HTML Elements
#### Heading
elements are used for section headings.

```
<h1>Heading 1</h1>
<h2>Heading 2</h2>
<h3>Heading 3</h3>
<h4>Heading 4</h4>
<h5>Heading 5</h5>
<h6>Heading 6</h6>
```

#### Paragraph
elements are used for standard blocks of text

```
<p>This is just a block of text.</p>
```

#### Span
elements are used to group text within another block of text, often for styling.
```
<p>This block of text has a <span>element</span> inside it.</p>

```
#### Image
elements are used to embed images in a web page
```
<img src="image-file.jpg" alt="text that describes the image" />
```

### Trees represent hierarchical data

Web developers use trees to represent the data that makes up websites. Elements belong to each other or are descended from one another.


![Tree](./data/images/trees.png)

We can create a tree structure in HTML by putting elements inside other elements. To do this we often use a `<div>  `element as a container. `<div` elements are used to group chunks of content together.

```
<div>
    <h1>This is a heading.</h1>
    <p>This is a paragraph.</p>
</div>
```

![HTML&Tree](./data/images/HTML&Tree.png)


## Using Beautiful Soup

#### Use the Find, select Methods
`find()` is one of the most popular Beautiful Soup methods. It is similar to the find feature in a text editor. <br/>
`select_one()` which finds only the first tag that matches a selector. <br />
To find classes you can use: 
* `{'class': 'class_name'}`
* `class_='class_name'`
* `.class_name`

#### Example:
Here we attempt to use find method to get the title of the webpage:
```
soup.find('text_string')
```

When we apply this to get the title of our webpage:
```
soup.find('title')
```

We get the title element of the webpage with its opening and closing tags

```
<title>Tampere BNB</title>

```

To get the webpage title only, we can use `.text`:
```
soup.find('title').text
```

which gives us:
```
'Tampere BNB'
```


## Installing packages

using conda:
```python
    conda install <package name>
```

using pip:
```python
    pip install <package name>
```

using pip within Jupyter Notebook:
```python
    !pip install <package name>
```

# Gather

In [475]:
# load packages
import os # establish the interaction between the user and the operating system
import glob # allows for Unix-style pathname pattern 

import re #check whether a given string matches a given pattern
from bs4 import BeautifulSoup # pull data out of HTML and XML files

import pandas as pd #pulling data out of HTML and XML files

In [476]:
#check currect directory
path = os.getcwd()
print ("The current working directory is %s" % path)

The current working directory is /Users/roope/repos/joda2024/week_04/5.2-scraper


In [477]:
#reading files
indices = glob.glob('./data/accom/*.html')
len(indices)

1080

In [478]:
#loading the data from local
local_files = list()
for fname in glob.glob('./data/accom/*.html'):
	local_files.append(fname.split('/')[-1])

In [479]:
# Please, use the following list as a column name
column = ["Region", "Price", "Type", "Size", "Floor", "Construction", "Condition", "Elevator", "Longitude", "Latitude"]

In [480]:
#creating dictionary of lists with pre-specified columns/keys
accom = dict()
for col in column:
	accom[col] = list()
print(accom)

{'Region': [], 'Price': [], 'Type': [], 'Size': [], 'Floor': [], 'Construction': [], 'Condition': [], 'Elevator': [], 'Longitude': [], 'Latitude': []}


In [481]:
for fname in indices:
	with open(fname) as f:
		content = f.read()
	soup = BeautifulSoup(content, 'html.parser')
	page_content = soup.find('div', {'class': 'container'})
	#getting the price
	price = page_content.select_one('.basic-info .profile_price').text
	price = price.split(" ")[0].strip()
	accom["Price"].append(price)
	# TODO: get all info on space
	# space = page_content.select_one('.about_space_basic').text.strip()
	# accom["Type"].append(space)

	# TODO: get the region
	region = soup.select('.map > h3')[1].get_text(strip=True)
	accom["Region"].append(region)

	# TODO: get the apartment type
	apartment_type = soup.select('.about_space_type > p')[0].get_text(strip=True)
	accom["Type"].append(apartment_type)

	# TODO: get the apartment size
	apartment_size = soup.select('.about_space_area > p')[0].get_text(strip=True)
	apartment_size = apartment_size.split(" ")[0].strip()
	accom["Size"].append(apartment_size)

	# TODO: get the apartment floor
	apartment_floor = soup.select('.about_space_floor > p')[0].get_text(strip=True)
	apartment_floor = apartment_floor.split(" ")[4].strip()
	accom["Floor"].append(apartment_floor)

	# TODO: get the apartment construction year
	construction_year = soup.select('.about_space_year > p')[0].get_text(strip=True)
	construction_year = construction_year.split(" ")[7].strip()
	accom["Construction"].append(construction_year)

	# TODO: get the apartment status
	apartment_status = soup.select('.about_space_condition > p')[0].get_text(strip=True)
	apartment_status = apartment_status.split(" ")[13].strip()
	accom["Condition"].append(apartment_status)

	# TODO: get the availability of the elevator
	elevator_availability = soup.select('.offers > tr > td')[0].get_text(strip=True)
	
	if (len(elevator_availability) == 9):
		accom["Elevator"].append("Yes")
	else:
		accom["Elevator"].append("No")

	# TODO: get the longitude
	longitude = soup.select('.map > p')[0].get_text(strip=True)
	longitude = longitude.split(" ")[11].strip()
	accom["Longitude"].append(longitude)
	print(longitude)

	# TODO: get the langitude
	latitude = soup.select('.map > p')[0].get_text(strip=True)
	latitude = latitude.split(" ")[15].strip()
	latitudeA = latitude.split(".")[0].strip()
	latitude = latitudeA + "." + latitude.split(".")[1].strip()
	accom["Latitude"].append(latitude)
	


23.61033840613356
61.51083191092233
23.860745947641902
61.49763715174763
23.84882877070141
61.45446021008072
23.690050674048194
61.51717521456669
23.77786391016316
61.50136263215722
23.909398489723014
61.45443929053609
23.75629883973776
61.47969037887615
23.79473505028807
61.47327527674369
23.61280794956664
61.50791394445016
23.7372212385258
61.4692169081692
23.75410070068684
61.48845611660885
23.7218593813145
61.49337400754037
23.654468960495112
61.52643743189711
23.77890203016489
61.501400052275365
23.843247722611057
61.49577978585621
23.80470557147105
61.50111276254443
23.846884214802827
61.45014896210666
23.78540733656877
61.4833588913318
23.88023790313173
61.47055749853711
23.75646732682778
61.48096205829832
23.84991861897812
61.45361830878401
23.740802394066232
61.46668742702547
23.81055042820581
61.43782084246731
23.929909231932868
61.520584795646805
23.79637971575917
61.506508980470656
23.58837971830125
61.49654107647841
23.72716008186165
61.49465430727402
23.853564696275715
61

23.88507684823709
61.477734165769775
23.84717956030425
61.45119883148132
23.70462466076483
61.532705615259296
23.612668753798456
61.50769159273728
23.754687096350818
61.48472545826787
23.78256368370284
61.505237718366864
23.845366237163923
61.49054272168234
23.80737742436945
61.49885684540702
23.87677966690817
61.4342886540294
23.847700367958087
61.4485859676029
24.06424710518933
61.463209588117685
23.690575096783487
61.505605065318434
23.757390657403384
61.48377405086652
23.91453213560984
61.50070246095111
23.72301397994764
61.49558072571923
23.78158387215157
61.498020660609576
23.58693497606129
61.49887928678592
23.85374427908082
61.45278154069759
23.83406774754452
61.48779507725349
23.70231479885556
61.52949819913286
23.602377396223467
61.51544362453929
23.87892666488965
61.43940881848194
23.72067289072252
61.49216300043432
24.06466516572807
61.459578497311846
23.611905634469853
61.50617780342456
23.807737473947004
61.502028306570736
23.899829003766275
61.46149815377036
23.822773637

In [482]:
#constructing a DataFrame from a dict
df = pd.DataFrame.from_dict(accom)

# Access

Assessing your data is the second step in data wrangling. When assessing, you're like a detective at work, inspecting your dataset for two things: data quality issues (i.e. content issues) and lack of tidiness (i.e. structural issues).
- Quality: issues with content. Low quality data is also known as dirty data.
- Tidiness: issues with structure that prevent easy analysis. Untidy data is also known as messy data. Tidy data requirements:
    - Each variable forms a column.
    - Each observation forms a row.
    - Each type of observational unit forms a table.

<div class="alert alert-block alert-info">
<b>Tip:</b> This part is not graded and the codes are provided, you can just run the cells and access your dataset.
</div>


In [483]:
#loading pickle file and getting the first five rows
df.head()

Unnamed: 0,Region,Price,Type,Size,Floor,Construction,Condition,Elevator,Longitude,Latitude
0,"Tesoma, Tampere",€180,Three rooms,74.0,7,1966,unknown,No,23.61033840613356,61.51083191092233
1,"Takahuhti, Tampere",€243,Two rooms,57.5,4,1972,good,No,23.860745947641902,61.49763715174763
2,"Hervanta, Tampere",€177,Studio apartment,27.0,0,2019,good,No,23.84882877070141,61.45446021008072
3,"Niemenranta, Tampere",€153,Studio apartment,24.5,1,2017,good,No,23.690050674048194,61.51717521456669
4,"Tammela, Tampere",€363,Three rooms,73.0,4,1971,good,No,23.77786391016316,61.50136263215722


In [484]:
#loading pickle file and getting the last five rows
df.tail()

Unnamed: 0,Region,Price,Type,Size,Floor,Construction,Condition,Elevator,Longitude,Latitude
1075,"Rautaharkko, Tampere",€168,Two rooms,42.0,0,1955,unknown,No,23.768104537978544,61.472621213004935
1076,"Vuores, Tampere",€399,Four rooms or more,95.0,-1,2012,good,No,23.808197301985487,61.43451271139134
1077,"Hämeenpuisto, Tampere",€222,Studio apartment,28.0,5,1908,good,No,23.75151211981315,61.49771017598682
1078,"Hakametsä, Tampere",€351,Three rooms,63.0,5,2012,good,No,23.829752825287773,61.48832464659127
1079,"Ikuri, Tampere",€366,Four rooms or more,141.0,0,1974,unknown,No,23.59902615472741,61.51101167244906


In [485]:
# gettign a summary of the data set
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1080 entries, 0 to 1079
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Region        1080 non-null   object
 1   Price         1080 non-null   object
 2   Type          1080 non-null   object
 3   Size          1080 non-null   object
 4   Floor         1080 non-null   object
 5   Construction  1080 non-null   object
 6   Condition     1080 non-null   object
 7   Elevator      1080 non-null   object
 8   Longitude     1080 non-null   object
 9   Latitude      1080 non-null   object
dtypes: object(10)
memory usage: 84.5+ KB


In [486]:
#check null values in Price feature
df[df['Price'].isnull()]

Unnamed: 0,Region,Price,Type,Size,Floor,Construction,Condition,Elevator,Longitude,Latitude


In [487]:
#check duplicated values
# df[df.duplicated()]

In [488]:
#gettign a summary statistics of the construction feature
df[["Construction"]].describe()

Unnamed: 0,Construction
count,1080
unique,99
top,2022
freq,92


In [489]:
#gettign a series of unique values of apartment Type in feature
df["Type"].value_counts()

Type
Two rooms             463
Three rooms           262
Studio apartment      204
Four rooms or more    151
Name: count, dtype: int64

# Clean

In [490]:
#Make a copy of this object’s indices and data.
df_clean = df.copy()

In [491]:
#change the data format

#changing the data format of floor feature
df_clean["Floor"] = pd.to_numeric(df_clean["Floor"])

# TODO: change the data format of Size feature
df_clean["Size"] = pd.to_numeric(df_clean["Size"])

# TODO: change the data format of Longitude feature
df_clean["Longitude"] = pd.to_numeric(df_clean["Longitude"])

# TODO: change the data format of Latitude feature
df_clean["Latitude"] = pd.to_numeric(df_clean["Latitude"])

# TODO: change the data format of Construction feature
df_clean["Construction"] = pd.to_numeric(df_clean["Construction"])

In [492]:
#check changes
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1080 entries, 0 to 1079
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Region        1080 non-null   object 
 1   Price         1080 non-null   object 
 2   Type          1080 non-null   object 
 3   Size          1080 non-null   float64
 4   Floor         1080 non-null   int64  
 5   Construction  1080 non-null   int64  
 6   Condition     1080 non-null   object 
 7   Elevator      1080 non-null   object 
 8   Longitude     1080 non-null   float64
 9   Latitude      1080 non-null   float64
dtypes: float64(3), int64(2), object(5)
memory usage: 84.5+ KB


In [493]:
#save and load latest changes
df_clean.to_pickle("TampereBNB.pkl")
unpickled_df = pd.read_pickle("TampereBNB.pkl")
unpickled_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1080 entries, 0 to 1079
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Region        1080 non-null   object 
 1   Price         1080 non-null   object 
 2   Type          1080 non-null   object 
 3   Size          1080 non-null   float64
 4   Floor         1080 non-null   int64  
 5   Construction  1080 non-null   int64  
 6   Condition     1080 non-null   object 
 7   Elevator      1080 non-null   object 
 8   Longitude     1080 non-null   float64
 9   Latitude      1080 non-null   float64
dtypes: float64(3), int64(2), object(5)
memory usage: 84.5+ KB
