# Data analyzing with web scraping

In this assignment I will try to scrape daylight information from timeanddate.com site. The dataset contains the information from 2015 to 2020 of four cities: Helsinki, Jyvaskyla, Rovaniemi, Ivalo.

![Image of sun time](image.svg)

## Data collecting

In [1]:
import requests
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import glob
import os

import plotly.graph_objects as go
import chart_studio.plotly as py
import cufflinks as cf
from plotly.offline import iplot, init_notebook_mode
from matplotlib import pyplot as plt
import seaborn as sns
import plotly.express as px

init_notebook_mode(connected=True)
cf.go_offline(connected=True)
cf.set_config_file(theme="pearl")
pd.set_option('display.max_columns', 30)
%matplotlib inline

In [2]:
city_name = ['helsinki', "rovaniemi", 'ivalo', 'jyvaskyla']
# month = [m for m in range(1,13)]
# year = [y for y in range(2015,2021)]

# for city in city_name:
#     for y in year:
#         for m in month:
#             url = 'https://www.timeanddate.com/sun/finland/'+city+'?month='+str(m)+'&year='+str(y)
#             with requests.get(url) as response:
#                 soup = BeautifulSoup(response.text, "html.parser")
#                 day = soup.findAll("tr", {"title": "Click to expand for more details"})
#                 for d in day:
#                     final_date = str(y)+'-'+str(m)+'-'+str(d['data-day'])
#                     ele = d.findAll("td")
#                     sun = ele[0].find(text=True, recursive=False).strip()
#                     if len(sun.split("."))==1:
#                         sunrise_time = sun
#                         sunset_time = sun
#                         day_length = 0
#                         dif_ = ele[1].find(text=True, recursive=False)
#                         if dif_ == None:
#                             day_diff = 0
#                         else:
#                             day_diff = dif_.strip()
#                     else:
#                         sunrise_time = ele[0].find(text=True, recursive=False).strip().replace(".",":")
#                         sunset_time = ele[1].find(text=True, recursive=False).strip().replace(".",":")
#                         day_length = ele[2].find(text=True, recursive=False).strip()
#                         dif_ = ele[3].find(text=True, recursive=False)
#                         if dif_ == None:
#                             day_diff = 0
#                         else:
#                             day_diff = dif_.strip()
#                     twilight_start = ele[-4].find(text=True, recursive=False).strip().replace(".",":")
#                     twilight_end = ele[-3].find(text=True, recursive=False).strip().replace(".",":")
#                     solar_noon = ele[-2].find(text=True, recursive=False).strip().replace(".",":")
                    
#                     dataset.append([city,final_date,sunrise_time,sunset_time, day_length,day_diff, twilight_start, twilight_end, solar_noon])
    
#     df = pd.DataFrame(dataset,columns=['City','Date','Sunrise','Sunset','Day length', 'Day length diff.',
#                                       'twilight_start', 'twilight_end', 'solar noon'])
#     df.to_csv(f"{city}.csv")

Since the amount of data is limited, I saved each city information into files for accessing later.

## Data description

In [3]:
file_list = glob.glob(os.path.join("data/*.csv"))

In [4]:
data = pd.DataFrame()
for file in (file_list):
    df = pd.read_csv(file, index_col=[0], parse_dates=["Date"],
                  date_parser= lambda x: pd.to_datetime(x, errors='coerce'))
    data = pd.concat([data, df], ignore_index=True)

Originally, the format of data is:

In [5]:
data.head()

Unnamed: 0,City,Date,Sunrise,Sunset,Day length,Day length diff.,twilight_start,twilight_end,solar noon
0,helsinki,2015-01-01,9:24,15:23,5:59:09,+1:48,8:26,16:20,12:23
1,helsinki,2015-01-02,9:23,15:24,6:01:07,+1:58,8:26,16:21,12:24
2,helsinki,2015-01-03,9:23,15:26,6:03:15,+2:08,8:26,16:23,12:24
3,helsinki,2015-01-04,9:22,15:27,6:05:33,+2:17,8:25,16:24,12:25
4,helsinki,2015-01-05,9:21,15:29,6:08:00,+2:26,8:25,16:25,12:25


In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8768 entries, 0 to 8767
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   City              8768 non-null   object        
 1   Date              8768 non-null   datetime64[ns]
 2   Sunrise           8768 non-null   object        
 3   Sunset            8768 non-null   object        
 4   Day length        8768 non-null   object        
 5   Day length diff.  8768 non-null   object        
 6   twilight_start    8768 non-null   object        
 7   twilight_end      8768 non-null   object        
 8   solar noon        8768 non-null   object        
dtypes: datetime64[ns](1), object(8)
memory usage: 616.6+ KB


The dataset includes following columns:
- City: city name
- Date: the date of the record
- Sunrise: the time the sun rise, if there is no hour recorded, it indicates that the sun doesn't rise that day and the value will be "Down all day".
- Sunset: the time the sun set, if there is no hour recorded, it indicates that the sun doesn't set that day and the value will be "Up all day".
- Day length: the duration in of day time. For "Up all day" sunrise variable, it will be 24 and 0 for "Down all day".
- Day length diff.: the difference in day time in comparison to previous day. 
    - 1 for longer duration
    - 0 for shorter duration
- twilight_start: the time twilight starts. If there is no records of sunrise/sunset, this value will be "Rest of night".
- twilight_end: the time twilight starts. If there is no records of sunrise/sunset, this value will be "Rest of night".
- solar noon: the time the sun is at the highest point in the day.

Except the Date columns, which is in the right format when reading, other are in string type. Therefore, I convert them into the timedelta type to later analyzing.

In [7]:
def int_to_timedelta(col,):
    hours = int(col.split(":")[0])
    minutes = int(col.split(":")[1])
    hours_timedelta = pd.to_timedelta(hours,unit="h")
    minutes_timedelta = pd.to_timedelta(minutes,unit="m")

    return (hours_timedelta + minutes_timedelta)

In [8]:
data.loc[data["Sunrise"]=="-", "Sunrise"] = "Up all day"
data.loc[data["Sunset"]=="-", "Sunset"] = "Up all day"
data.loc[data["Sunrise"]=="Up all day", "Sunset"] = "Up all day"
data.loc[data["Sunset"]=="Up all day", "Sunrise"] = "Up all day"
data.Sunrise = data.Sunrise.apply(lambda x: int_to_timedelta(x) if len(x.split(":"))>1 else x)
data.Sunset = data.Sunset.apply(lambda x: int_to_timedelta(x) if len(x.split(":"))>1 else x)

data.loc[data["Sunrise"]=="Up all day", "Day length"] = pd.to_timedelta(24, unit='h')
data.loc[data["Sunrise"]=="Down all day", "Day length"] = pd.to_timedelta(0, unit='h')

data["Day length"] = data["Day length"].apply(lambda x: pd.to_timedelta(x))

data["Day length diff."] = data["Day length diff."].apply(lambda x: 1 if x[0]=="+" else 0)

data["twilight_start"] = data["twilight_start"].apply(lambda x: 'Rest of night' if x=="-" else x)
data.loc[data["twilight_start"]=="Rest of night", "twilight_end"] = "Rest of night"
data["twilight_start"] = data["twilight_start"].apply(lambda x: int_to_timedelta(x) if len(x.split(":"))>1 else x)
data["twilight_end"] = data["twilight_end"].apply(lambda x: int_to_timedelta(x) if len(x.split(":"))>1 else x)
data.loc[data["twilight_end"]=="Rest of night", "twilight_start"] = "Rest of night"

data["solar noon"] = data["solar noon"].apply(int_to_timedelta)

After cleaning, dataset is in this format:

In [9]:
data.head()

Unnamed: 0,City,Date,Sunrise,Sunset,Day length,Day length diff.,twilight_start,twilight_end,solar noon
0,helsinki,2015-01-01,0 days 09:24:00,0 days 15:23:00,0 days 05:59:09,1,0 days 08:26:00,0 days 16:20:00,0 days 12:23:00
1,helsinki,2015-01-02,0 days 09:23:00,0 days 15:24:00,0 days 06:01:07,1,0 days 08:26:00,0 days 16:21:00,0 days 12:24:00
2,helsinki,2015-01-03,0 days 09:23:00,0 days 15:26:00,0 days 06:03:15,1,0 days 08:26:00,0 days 16:23:00,0 days 12:24:00
3,helsinki,2015-01-04,0 days 09:22:00,0 days 15:27:00,0 days 06:05:33,1,0 days 08:25:00,0 days 16:24:00,0 days 12:25:00
4,helsinki,2015-01-05,0 days 09:21:00,0 days 15:29:00,0 days 06:08:00,1,0 days 08:25:00,0 days 16:25:00,0 days 12:25:00


## Data analyzing
### Sun graph in 2020

In [10]:
df1 = data.copy()
up_start = pd.to_timedelta("00:00:01")
up_end = pd.to_timedelta("23:59:59")

down_start = pd.to_timedelta("11:59:59")
down_end = pd.to_timedelta("12:00:01")

df1.Sunrise = df1.Sunrise.apply(lambda x: up_start if x=="Up all day" else (down_start if x == "Down all day" else x))
df1.Sunset = df1.Sunset.apply(lambda x: up_end if x=="Up all day" else (down_end if x == "Down all day" else x))

In [11]:
a = df1[df1.Date.dt.year == 2020]

In [12]:
fig = px.line(a, x="Date", y = ["Sunrise", "Sunset"], color = "City")

fig.update_layout(title="sunrise, sunset and day length for each city")
fig.update_yaxes(showticklabels=False)
fig.update_xaxes(showgrid=False)
fig.show()

The lower line is sunrise time for each city, and the upper line is sunset time. Overall, we can see that city that is near the North like Rovaniemi and Ivalo have wider range of day length (more gap between sunrise and sunset time) while city are less near like Helsinki and Jyväskylä have smaller range. There are 4 sharp spike in this chart, and they indicate the date the time zone changes.

Since have the wider range of day length, Ivalo and Rovaniemi have days that the sun is up all day or down all day. The number of these days for Ivalo is higher since it is further from equator in comparison to Rovaniemi.

### Day length

In [13]:
a = df1.groupby(["City", df1.Date.dt.year])["Day length"].mean(numeric_only=False).reset_index()

In [14]:
fig = px.line(a, x="Date", y = ["Day length"], color = "City")

fig.update_layout(title="duration of day of each city from 2015 to 2020")
fig.update_yaxes(showticklabels=False)
fig.update_xaxes(showgrid=False)
fig.show()

On average, the average day length of cities doesn't change much which also indicates that the axis of doesn't change much in these years. However, since the time period is short, it doesn't confirm anything. There are a decrease in day length in 2016 and 2020. Since these two year are leap year which have one more day in February, and during this time, the day length is short which decrease the overall day length.

### Up all day & Down all day

In [15]:
a = data[data.Sunrise == "Up all day"].groupby(["City", data.Date.dt.year])["Sunrise"].count().reset_index()

In [16]:
a.pivot(index="Date", columns="City", values="Sunrise").iplot(kind='bar', title="Number of day that have the sun above horizon all day")

The number of days that have the sun above horizon all day of Ivalo is much higher than Rovaniemi since it is further from the equator. Average number of day is 69 days for Ivalo and 50 days for Rovaniemi.

In [17]:
a = data[data.Sunrise == "Down all day"].groupby(["City", data.Date.dt.year])["Sunrise"].count().reset_index()

In [18]:
fig = px.bar(a, x="Date", y="Sunrise", color="City")

fig.update_layout(title="Number of day that have the sun below horizon all day")
fig.show()

Ivalo is the only city in this list that have days that don't have the sun appear above horizon. And on average, it has 36 days.

In [19]:
a = data[(data.Sunrise == "Up all day")| (data.Sunrise == "Down all day")]
b = a.groupby(["City", "Sunrise", a.Date.dt.month])["Sunset"].count().reset_index()
b.Sunset = b.Sunset/6

In [20]:
fig = px.sunburst(b, path=["Date", "Sunrise", "City"], values="Sunset", color="City")

fig.update_layout(title="Sunburst plot of Up all day and Down all day")

fig.show()

### Solar noon

In [21]:
a = data[data.Date.dt.year == 2020]
fig = px.line(a, x="Date", y = "solar noon", color ="City")

fig.update_layout(title="Solar noon time for each city")
fig.update_yaxes(showticklabels=False)
fig.update_xaxes(showgrid=False)
fig.show()

In [22]:
a = data[data.Date.dt.year == 2020]
a.loc[a.Date.between("2020-03-29", "2020-10-24"), "solar noon"] = a[a.Date.between("2020-03-29", "2020-10-24")]["solar noon"]-pd.to_timedelta(1, unit='h')

In [23]:
fig = px.line(a, x="Date", y = "solar noon", color ="City")

fig.update_layout(title="Solar noon time for each city")
fig.update_yaxes(showticklabels=False)
fig.update_xaxes(showgrid=False)
fig.show()

The time for the sun to reach its highest point is also different across city. Ivalo have the earliest of solar noon time in comparison with other cities while Helsinki have the lastest time since their different in coordinate quite big. The solar noon time for Jyväskylä and Rovaniemi are quite similar despite of the difference in coordinate.