## Content

* What is Python??
* Python basics
    * Data structure
    * Pandas dataframe
    * Function and Class definition
    * Package manager and environments
* File Input/Ouput
* Database connection
* Web datacrawling
* Data visualization
* Model fitting

## 0. What is Python??

### 0.1. In brief

* High-level programming language for general-purpose programming
* Supports multiple programming paradigms
    * Object-oriented
    * Functional
    * ...
* Easy interface with other languages, such as C++/Java
* A large and comprehensive standard library
* Not so fast though …

![title](../pics/history.png)

### 0.2. The eco-system

![title](../pics/ecosystem.png)

### 0.3. Python 2 vs. Python 3

In [None]:
from IPython.display import IFrame
IFrame('https://pythonclock.org/', width=700, height=200)

### 0.4. IDE
(picture source: https://www.datacamp.com/community/tutorials/tutorial-jupyter-notebook)

* Jupyter Notebook/Jupyter Lab(beta)
![title](../pics/jupyternotebook.gif)

* Visual Studio Code
(picture source: https://code.visualstudio.com/docs/python/editing)
![title](../pics/vscode.gif)
* PyCharm
* Spyder
* Atom
* ...

### 0.5. Prerequisite

In order to run the following code
* Anaconda (recommended) (https://www.anaconda.com/distribution/)
* Library list included (requirements.txt)

In order to run the notebook in presentation mode
* RISE extension to jupyter notebook (https://github.com/damianavila/RISE)

## 1. Python Basics

### 1.1 Data structures

* Variable definition

In [None]:
a = 123
print(a)

* Iterables:

In [None]:
# list
a = [1, 'a', 3]
a

In [None]:
a[1]

In [None]:
a.append(4) ## append values to the list
a

In [None]:
# Set
a = {1,2,3}
b = {2,3,4}
print(a)
print(b)

In [None]:
a.update([5,6,7]) ## append values
a

In [None]:
a - b

In [None]:
b - a

In [None]:
a&b ## intersection

In [None]:
a|b ## union

In [None]:
# Dictionary

x = {'a':1,'b':[2,3,4],'c':{'d':[1,2,3]}}

In [None]:
x['a']

In [None]:
x['c']['d']

In [None]:
x['e'] = 5 ## add new entry
x

* Pandas dataframe

In [None]:
import pandas as pd ## import the pandas library

# show complete dataframe content
pd.options.display.max_columns = None
pd.options.display.max_rows = None
pd.set_option('display.max_colwidth', -1)

In [None]:
df = pd.DataFrame({'name':['ABC','DEF','GHI','JKL'],'age':[20,30,40,50]}) # create from a dictionary
df

In [None]:
names = ['ABC','DEF','GHI','JKL']
ages = [20,30,40,50]
df = pd.DataFrame(zip(names,ages), columns=['name','ages'])
df

In [None]:
df.loc[:,'name'] # slice by column name

In [None]:
df.iloc[:,0] # slice by column index

In [None]:
df.loc[df.name=='ABC'] # slice by condition

In [None]:
df.ages.describe()

In [None]:
df2 = pd.DataFrame({'name':['DEF','GHI','JKL'], 'hometown':['Atlanta, GA', 'Atlanta, GA', 'Knoxville, TN']})
df2

In [None]:
df_combo = pd.concat([df,df2],axis=0,sort=False) # stack two dataframes
df_combo

In [None]:
df_combo = pd.merge( # join dataframes
    df,
    df2,
    on='name',
    how='left'
) ## Other tools are available to do sql like operation on dataframe (https://pypi.org/project/pandasql/)
df_combo

In [None]:
df_combo

In [None]:
df_combo.groupby('hometown').size().reset_index() # simple statistics

In [None]:
df_combo.groupby(['name','hometown']).size().reset_index().rename(columns={0:'frequency'})

In [None]:
df_combo['num_pets'] = [1,2,2,3] # create a pivot table
df_combo.pivot_table(
    index='name',
    columns='hometown',
    values='num_pets',
    aggfunc='sum'
).fillna(0)

* Function definition

        Regular function

In [None]:
def helloworld(name):
    print('My name is {}'.format(name))
    # print('My name is %s' % name)

helloworld('Bot')

        Lambda function

In [None]:
helloworld2 = lambda name: print('My name is {}'.format(name))
helloworld2('Robot')

* Class definition

In [None]:
class table(object):
    """
    Input table dimensions, calculate table properties    
    Parameters
    ----------
    length: int, table length
    width: int, table width
    height: int, table height
    """
    WHOAMI = 'A table'
    def __init__(self, length, width, height):
        self.length = length
        self.width = width
        self.height = height
    def toparea(self):
        return(self.length * self.width)

In [None]:
tb = table(2,3,4)
tb = table(length=2, width=3, height=4)
print(tb.WHOAMI)
print(tb.length, tb.width, tb.height)
print(tb.toparea())

* Value assignment

In [None]:
a = 6
a

In [None]:
b = a # creat a new copy
b

In [None]:
a = [1,2,3]
b = a # create a reference
b

In [None]:
a[1] = 4
b

In [None]:
table_a = table(2,3,4)
table_b = table_a # create a reference
table_a.length = 5
table_b.length

### 1.2. Control statement

* Loop

In [None]:
for i in range(2): # i could be 0-2
    print(i)

In [None]:
i = 0
while i<3:
    i+=1
print(i)

In [None]:
## a better visualization!!
from tqdm import trange
import time

a = 0
for i in trange(100):
    time.sleep(0.1)
    a = a + 1
print(a)

* Condition structure

In [None]:
values = [1,2,3,4]
for value in values:
    if value%2==0:
        print(value)

In [None]:
x = 5
output = 1 if x<3 else 0 ## conditional value assignment
output

In [None]:
## combination of loop and condition
output = [
    i
    for i in trange(10) if i%2==0
] # list comprehension
output

### 1.3. Package manager

* pip (example)
* conda

### 1.4. Virtual Environment
* virtualenv (example)
* conda

## 2. File Input/Output

* Read a file
    * President Trump's inauguration speech (2017)

In [None]:
with open('../data/trump_inauguration.txt','r') as f:
    for index,line in enumerate(f.readlines()):
        if index<4:
            print(line)

* Write to a file
    * Example: Count word frequency and write result to a text file

In [None]:
## count the word frequency
dict_freq = {}

with open('../data/trump_inauguration.txt','r') as f:
    for line in f.readlines():
        words = line.lower().split()
        for word in words:
            word2 = word.strip('"').strip("'").replace(',','').replace('.','').replace(';','').replace('–','')
            if len(word2)>0:
                if dict_freq.get(word2,-10000)==-10000:
                    dict_freq[word2]=1
                else:
                    dict_freq[word2]=dict_freq[word2]+1

In [None]:
## write to txt file
with open('../data/word_count_out.csv','w') as f:
    for key,value in dict_freq.items():
        f.write('{word},{freq}\n'.format(word=key,freq=value))

In [None]:
## check the file content
import subprocess
result = subprocess.check_output('head -5 ../data/word_count_out.csv',shell=True).decode('utf-8')
print(result)

* File analysis with Pandas dataframe
    * Example: Find the most frequent words

In [None]:
df = pd.read_csv('../data/word_count_out.csv',sep=',',header=None,names=['word','freq'])
df.head(3)

In [None]:
from sklearn.feature_extraction import stop_words
df2 = df.loc[~df.word.isin(
    stop_words.ENGLISH_STOP_WORDS
)]

In [None]:
df2.sort_values(by='freq',ascending=False).head(10)

## 3. Database Connection

* Common tools
    * pyodbc
    * sqlalchemy
    * ...
* MySQL example
    * library: mysqlclient (https://pypi.org/project/mysqlclient/)

    ![title](../pics/mysql_query.png)

In [None]:
import MySQLdb
import pandas as pd

# create the database connection
db = MySQLdb.connect(host="localhost",
                     user="test123",
                     passwd="1234")
# query database, get output as a dataframe
df = pd.read_sql('select * from adhoc.word_count',db)

In [None]:
## another way
# create a cursor to execute query
cur = db.cursor()
cur.execute("select * from adhoc.word_count")
result = list(cur.fetchall())
df = pd.DataFrame(result,columns=['word','freq'])
df.sort_values(by='freq',ascending=False).head(10)
# remove the stop words
from sklearn.feature_extraction import stop_words
df2 = df.loc[~df.word.isin(
    stop_words.ENGLISH_STOP_WORDS
)]
df2.sort_values(by='freq',ascending=False).head(10)

## 4. Webpage crawling

* Common tools
    * requests, beautifulsoup, etc.
    * regex match (re)
* Example (www.advantage.com, grab all car rental location information)


In [None]:
from IPython.display import IFrame
IFrame("https://www.advantage.com/us-location/", width=1400, height=700)

![title](../pics/html_locations.png)

In [None]:
import urllib
import re

base = 'https://www.advantage.com/us-location/'
res = urllib.request.urlopen(base)
html = res.read().decode('utf-8')

In [None]:
## step 1: get all car rental locations
airports = re.findall(
    'fa fa\-map\-marker" aria\-hidden="true"></i>\s*?<p>(.*?)</p>',
    html
)
airports[:5]

![title](../pics/html_locations.png)

In [None]:
## step 2: get all page links for car rental locations
paths = re.findall(
    'href="?(.*)"\s*class="aez-icon-location"',
    html
)
paths[:5]

![title](../pics/html_address.png)

In [None]:
## step 3: get the address
paths_full = [
    base + path.split('/')[-1]
    for path in paths
]
paths_full[:3]

In [None]:
path2search = 'https://www.advantage.com/us-location/phoenix-sky-harbor-airport-phx'
html = urllib.request.urlopen(path2search).read().decode('utf-8')
address = re.findall('Address:</h4>\s*<p class="aez\-info\-text">(.*?)<br>(.*?)<br>',html)
address

## 5. Data Visualization

* Common tools:
    * Matplotlib
    * Seaborn
    * ggplot
    * plotly
    * ...
    ![title](../pics/pyplot.png)

* 5.1. X-Y plot

In [None]:
import matplotlib.pyplot as plt
import numpy as np

plt.figure(figsize=(5,3))
x = np.arange(0,10,0.1)
y = np.sin(x)
plt.plot(x,y,linestyle='--',color='blue')
plt.show()

In [None]:
plt.figure(figsize=(10,3))
y2 = np.exp(-x/2)
plt.subplot(1,2,1)
plt.plot(x,y,linestyle='--',color='blue')
plt.subplot(1,2,2)
plt.plot(x,y2,linestyle='-',color='red')
plt.show()

* 5.2. Bar chart

In [None]:
x = ['China','India','USA','Russia','Japan']
y = [1394710000,
     1344100000,
     328779000,
     146793744,
     126330000
    ] # data from wikipedia
plt.bar(
    x = x,
    height=y,
    width=0.5
)
# plt.yticks(
#     np.arange(0,1.4e9,0.2e9),
#     ['{:,.0f}'.format(value) for value in np.arange(0,1.4e9,0.2e9)]
# )
plt.show()

* 5.3. Histogram

In [None]:
y = np.random.randn(1000)
plt.hist(y,bins=100,cumulative=False)
plt.show()

In [None]:
x = np.arange(-3,3,0.01)
y2 = (1/np.sqrt(2*np.pi))*np.exp(-x**2/2)
plt.hist(y,bins=100,cumulative=False,density=True,label='hist')
plt.plot(x,y2,label='pdf')
plt.legend()
plt.show()

* 5.4. Heatmap
    * Example: Faked movie rating data

In [None]:
import random
movies = ['Avatar','Pirates of the Caribbean','Star Wars','Spider-Man','The Avengers']
users = [''.join(random.sample('abcdefghijklmnopqrstuvwxyz',5)) for i in range(5)]
ratings = np.array([
    random.sample(np.arange(1,5,0.5).tolist(),5)
    for i in range(5)
])
# ratings[:2,]

In [None]:
plt.imshow(ratings)

for i in range(len(users)):
    for j in range(len(movies)):
        plt.text(j, i, ratings[i, j],
                       ha="center", va="center", color="w")

plt.xticks(range(5),users)
plt.yticks(range(5),movies)
plt.colorbar()
plt.show()

* 5.5. Visualization on geolocation map
    * Common tools
        * folium (https://github.com/python-visualization/folium)
        * gmplot (https://github.com/vgm64/gmplot)

In [None]:
import folium
georgia = [
    (34.992756,-85.625226),
    (30.721736,-84.926279),
    (30.589405,-81.492173),
    (32.016373,-80.817001),
    (34.992871,-83.101132),
    (34.992756,-85.625226)
]
m = folium.Map([30.909508, -84.355094], zoom_start=6, height='60%')
folium.PolyLine(georgia).add_to(m)
m

In [None]:
locations = [
    (38.347, -77.488),
    (33.62, -84.499),
    (40.885999999999996, -81.566),
    (33.865, -117.84100000000001),
    (34.202, -118.402)
]
names = ['abc','def','ghi','jkl','mno']

m = folium.Map([34.909508, -89.355094], zoom_start=5, height='50%')

# mark all locations
for index,location in enumerate(locations):
    folium.Marker(
        [location[0], location[1]],
        popup=folium.Popup(names[index],parse_html=True),
        icon=folium.Icon(color='red')
    ).add_to(m)
m

## 6. Model Fitting Examples

* 6.1. Linear Regression (diabetes data)
    ![title](../pics/diabetes_data.png)

In [None]:
from sklearn import datasets

In [None]:
diabetes = datasets.load_diabetes()

In [None]:
diabetes['data'].shape

In [None]:
diabetes['target'].shape

In [None]:
diabetes['feature_names']

In [None]:
import pandas as pd

df = pd.DataFrame(diabetes['data'],columns=diabetes['feature_names'])
df['target'] = diabetes['target']
df.head(3)

In [None]:
df.corr()

In [None]:
df_x = df.loc[:,['bmi']]
df_y = df.loc[:,'target']

In [None]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(df_x,df_y,test_size=0.3,random_state=123)

In [None]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression(fit_intercept=True, normalize=False, copy_X=True, n_jobs=None)

In [None]:
lr.fit(x_train,y_train)
lr.coef_, lr.intercept_

In [None]:
lr.score(x_train,y_train) # R2 value

In [None]:
import numpy as np
y_test_hat = lr.predict(x_test)
MSE = np.mean((y_test_hat-y_test)**2)
R2 = lr.score(x_test,y_test)
print('''
MSE:{}
R2:{}
'''.format(MSE,R2))

In [None]:
import matplotlib.pyplot as plt
plt.figure(figsize=(20,5))
plt.subplot(1,2,1)
plt.plot(x_train['bmi'],y_train,'ro',label='raw-train')
plt.plot(x_train['bmi'],lr.predict(x_train),label='fit-train')
plt.legend()
plt.subplot(1,2,2)
plt.plot(x_test['bmi'],y_test,'go',label='raw-test')
plt.plot(x_test['bmi'],y_test_hat,label='fit-test')
plt.legend()
plt.show()

* 6.2. k-Means (Iris dataset)
    ![title](../pics/iris_data.png)

In [None]:
from sklearn import datasets
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
iris = datasets.load_iris()

In [None]:
iris['feature_names']

In [None]:
iris['target_names']

In [None]:
iris['data'][:3,]

In [None]:
iris['target']

In [None]:
from sklearn.cluster import KMeans

withincluster_ssd = []
## use elbow plot to determine the ideal number of clusters
for i in range(1,11):
    km = KMeans(n_clusters=i)
    km.fit(iris['data'])
    withincluster_ssd.append(km.inertia_)
## generate the elbow plot
plt.plot(list(range(1,11)),withincluster_ssd)
plt.vlines(x=3,ymin=0,ymax=600,linestyles='dashed')
plt.show()

In [None]:
kmeans = KMeans(n_clusters = 3)
y_kmeans = kmeans.fit_predict(iris['data']) #setosa,versicolor,virginica

In [None]:
#Visualise the cluster distribution
index_x = 0
index_y = 1
setosa = [iris['data'][y_kmeans==0,index_x],iris['data'][y_kmeans==0,index_y]]
versicolor = [iris['data'][y_kmeans==1,index_x],iris['data'][y_kmeans==1,index_y]]
virginica = [iris['data'][y_kmeans==2,index_x],iris['data'][y_kmeans==2,index_y]]
plt.plot(setosa[0],setosa[1],'ro',label='setosa')
plt.plot(versicolor[0],versicolor[1],'go',label='versicolor')
plt.plot(virginica[0],virginica[1],'bo',label='virginica')
plt.legend()
plt.show()