# Data Science Study Resources
Developed by: Yongkang Liu (yongkang.liu.phd@gmail.com)  
Created on October 7, 2019.  
Updated on October 7, 2019.

Data is in the center of modern STEM applications. 

This notebook maintains a collection of materials and references to study data science.


<a name="toc"></a>
# Table of contents

1. [Tutorials and references](#tut_ref)

1. [Tools](#tool)
    1. [Python General](#tool.python)
    1. [Beautiful Soup](#tool.beautifulSoup)

1. [End: To add a new section](#end)


<a name="tut_ref"></a>
# Tutorials and Reference

[A Beginner's Guide to the Data Science Pipeline](https://towardsdatascience.com/a-beginners-guide-to-the-data-science-pipeline-a4904b2d8ad3)

This article provides the general pipeline model, i.e., work flow, for data scientists. The author generalized it into an acronym, "O.S.E.M.N.", which is short for "Obtaining - Srubbing/Cleaning - Exploring/Visualizing - Modeling - Interpreting Data".



[How to Learn Data Science for Free](https://towardsdatascience.com/how-to-learn-data-science-for-free-eda10f04d083)

This article provides a rich collection of free online resources for data science self study.

To look for general Python programming information, jump to the notebook below.    
[Python Coding Tips](Python_Coding_Tips.ipynb)

To look for general Pandas programming tips, jump to the notebook below.  
[Pands_Coding_Tips](Pands_Coding_Tips.ipynb)

<a name="tool.beautifulSoup"></a>
### Beautiful Soup
[Back](#toc)

Reference: [Beautiful Soup Document](https://beautiful-soup-4.readthedocs.io/en/latest/)

Beautiful Soup is a Python library for pulling data out of HTML and XML files. The latest version is 4 on the date of November 4, 2019, which works for Python 2.7 and Python 3.2.





In [1]:
# import modules
from bs4 import BeautifulSoup
import requests     # an HTTP client to get the document behind a URL as Beautiful Soup expects a document instead of a URL
import lxml

In [13]:
# Obtain the HTML document from the URL
url = "https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M"
r = requests.get(url)
data = r.text

In [7]:
# Parser the data
soup = BeautifulSoup(data)
# Find the target table
table = soup.find('table', {'class': 'wikitable sortable'})

In [33]:
# Find all rows in the table
table_rows = table.find_all('tr')
print(f"Got table_rows, type: {type(table_rows)}, size: {len(table_rows)}")

Got table_rows, type: <class 'bs4.element.ResultSet'>, size: 289


In [34]:
# Find the headline
print(f'Check the first row: {table_rows[0]}')

Check the first row: <tr>
<th>Postcode</th>
<th>Borough</th>
<th>Neighbourhood
</th></tr>


In [60]:
# extract columns
columns = [th.text.strip('\n') for th in table_rows[0].find_all('th')]
print(f'The headline is {columns}')

The headline is ['Postcode', 'Borough', 'Neighbourhood']


In [61]:
# extract table rows
for rIndex, row in enumerate(table_rows):
    if rIndex > 0:
        row_content = [td.text.strip('\n') for td in row.find_all('td')]   
        print(f'Row {rIndex}: {row_content}')

Row 1: ['M1A', 'Not assigned', 'Not assigned']
Row 2: ['M2A', 'Not assigned', 'Not assigned']
Row 3: ['M3A', 'North York', 'Parkwoods']
Row 4: ['M4A', 'North York', 'Victoria Village']
Row 5: ['M5A', 'Downtown Toronto', 'Harbourfront']
Row 6: ['M5A', 'Downtown Toronto', 'Regent Park']
Row 7: ['M6A', 'North York', 'Lawrence Heights']
Row 8: ['M6A', 'North York', 'Lawrence Manor']
Row 9: ['M7A', "Queen's Park", 'Not assigned']
Row 10: ['M8A', 'Not assigned', 'Not assigned']
Row 11: ['M9A', 'Etobicoke', 'Islington Avenue']
Row 12: ['M1B', 'Scarborough', 'Rouge']
Row 13: ['M1B', 'Scarborough', 'Malvern']
Row 14: ['M2B', 'Not assigned', 'Not assigned']
Row 15: ['M3B', 'North York', 'Don Mills North']
Row 16: ['M4B', 'East York', 'Woodbine Gardens']
Row 17: ['M4B', 'East York', 'Parkview Hill']
Row 18: ['M5B', 'Downtown Toronto', 'Ryerson']
Row 19: ['M5B', 'Downtown Toronto', 'Garden District']
Row 20: ['M6B', 'North York', 'Glencairn']
Row 21: ['M7B', 'Not assigned', 'Not assigned']
Row 22:

In [63]:
lstRow = []
# extract table rows
for rIndex, row in enumerate(table_rows):
    if rIndex > 0:
        row_content = [td.text.strip('\n') for td in row.find_all('td')]
        lstRow.append(row_content)
        print(f'Row {rIndex}: {row_content}')

Row 1: ['M1A', 'Not assigned', 'Not assigned']
Row 2: ['M2A', 'Not assigned', 'Not assigned']
Row 3: ['M3A', 'North York', 'Parkwoods']
Row 4: ['M4A', 'North York', 'Victoria Village']
Row 5: ['M5A', 'Downtown Toronto', 'Harbourfront']
Row 6: ['M5A', 'Downtown Toronto', 'Regent Park']
Row 7: ['M6A', 'North York', 'Lawrence Heights']
Row 8: ['M6A', 'North York', 'Lawrence Manor']
Row 9: ['M7A', "Queen's Park", 'Not assigned']
Row 10: ['M8A', 'Not assigned', 'Not assigned']
Row 11: ['M9A', 'Etobicoke', 'Islington Avenue']
Row 12: ['M1B', 'Scarborough', 'Rouge']
Row 13: ['M1B', 'Scarborough', 'Malvern']
Row 14: ['M2B', 'Not assigned', 'Not assigned']
Row 15: ['M3B', 'North York', 'Don Mills North']
Row 16: ['M4B', 'East York', 'Woodbine Gardens']
Row 17: ['M4B', 'East York', 'Parkview Hill']
Row 18: ['M5B', 'Downtown Toronto', 'Ryerson']
Row 19: ['M5B', 'Downtown Toronto', 'Garden District']
Row 20: ['M6B', 'North York', 'Glencairn']
Row 21: ['M7B', 'Not assigned', 'Not assigned']
Row 22:

In [65]:
import pandas as pd
df_tnt = pd.DataFrame(lstRow, columns=columns)
df_tnt.shape

(288, 3)

In [99]:
df_tnt.Postcode.nunique()  # how many unique Postcodes are seen in the table including the assigned ones

180

In [66]:
df_tnt.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1A,Not assigned,Not assigned
1,M2A,Not assigned,Not assigned
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront


In [74]:
df_tnt_clean = df_tnt[df_tnt['Borough']!='Not assigned']
df_tnt_clean.shape

(211, 3)

In [100]:
df_tnt_clean.Postcode.nunique()

103

In [70]:
df_tnt_clean.head()

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights


In [78]:
df_empty = df_tnt_clean[df_tnt_clean['Neighbourhood']=='Not assigned']
df_empty

Unnamed: 0,Postcode,Borough,Neighbourhood
8,M7A,Queen's Park,Not assigned


In [83]:
def replace_name(x, y):
    if x == 'Not assigned':
        return y
    else:
        return x

df_tnt_clean['Neighbourhood'] = df_tnt_clean.apply(lambda x : replace_name(x.Neighbourhood, x.Borough), axis=1)
#df_tnt_clean.loc[df_tnt_clean.Neighbourhood=='Not assigned', 'Neighbourhood']=df_tnt_clean['Borough']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys


In [81]:
df_tnt_clean

Unnamed: 0,Postcode,Borough,Neighbourhood
2,M3A,North York,Parkwoods
3,M4A,North York,Victoria Village
4,M5A,Downtown Toronto,Harbourfront
5,M5A,Downtown Toronto,Regent Park
6,M6A,North York,Lawrence Heights
7,M6A,North York,Lawrence Manor
8,M7A,Queen's Park,Queen's Park
10,M9A,Etobicoke,Islington Avenue
11,M1B,Scarborough,Rouge
12,M1B,Scarborough,Malvern


In [95]:
#df_tnt_unique = df_tnt_clean.groupby('Postcode')['Neighbourhood'].apply(lambda neighbors : ','.join(neighbors))


df_tnt_unique = df_tnt_clean.groupby(['Postcode', 'Borough'])['Neighbourhood'].apply(lambda x: ','.join(x)).reset_index()
df_tnt_unique = df_tnt_clean.groupby(['Postcode', 'Borough'])['Neighbourhood'].apply(','.join).reset_index()
#df.groupby(['name','month'])['text'].apply(','.join).reset_index()

In [96]:
df_tnt_unique.shape

(103, 3)

In [97]:
df_tnt_unique.head()


Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae


In [98]:
df_tnt_unique.loc[]

Unnamed: 0,Postcode,Borough,Neighbourhood
0,M1B,Scarborough,"Rouge,Malvern"
1,M1C,Scarborough,"Highland Creek,Rouge Hill,Port Union"
2,M1E,Scarborough,"Guildwood,Morningside,West Hill"
3,M1G,Scarborough,Woburn
4,M1H,Scarborough,Cedarbrae
5,M1J,Scarborough,Scarborough Village
6,M1K,Scarborough,"East Birchmount Park,Ionview,Kennedy Park"
7,M1L,Scarborough,"Clairlea,Golden Mile,Oakridge"
8,M1M,Scarborough,"Cliffcrest,Cliffside,Scarborough Village West"
9,M1N,Scarborough,"Birch Cliff,Cliffside West"


<a name="end"></a>
## End of Notebook
[Back](#toc)