## Content

* What is Python??
* Python basics
    * Data structure
    * Pandas dataframe
    * Function and Class definition
    * Package manager and environments
* File Input/Ouput
* Database connection
* Web datacrawling
* Data visualization
* Model fitting

## 0. What is Python??

### 0.1. In brief

* High-level programming language for general-purpose programming
* Supports multiple programming paradigms
    * Object-oriented
    * Functional
    * ...
* Easy interface with other languages, such as C++/Java
* A large and comprehensive standard library
* Not so fast though …

![title](../pics/history.png)

### 0.2. The eco-system

![title](../pics/ecosystem.png)

### 0.3. Python 2 vs. Python 3

In [84]:
from IPython.display import IFrame
IFrame('https://pythonclock.org/', width=700, height=200)

### 0.4. IDE
(picture source: https://www.datacamp.com/community/tutorials/tutorial-jupyter-notebook)

* Jupyter Notebook/Jupyter Lab(beta)
![title](../pics/jupyternotebook.gif)

* Visual Studio Code
(picture source: https://code.visualstudio.com/docs/python/editing)
![title](../pics/vscode.gif)
* PyCharm
* Spyder
* Atom
* ...

### 0.5. Prerequisite

In order to run the following code
* Anaconda (recommended) (https://www.anaconda.com/distribution/)
* Library list included (requirements.txt)

In order to run the notebook in presentation mode
* RISE extension to jupyter notebook (https://github.com/damianavila/RISE)

## 1. Python Basics

### 1.1 Data structures

* Variable definition

In [2]:
a = 123
print(a)

123


* Iterables:

In [20]:
# list
a = [1, 'a', 3]
a

[1, 'a', 3]

In [21]:
a[1]

'a'

In [23]:
a.append(4) ## append values to the list
a

[1, 'a', 3, 4, 4]

In [24]:
# Set
a = {1,2,3}
b = {2,3,4}
print(a)
print(b)

{1, 2, 3}
{2, 3, 4}


In [25]:
a.update([5,6,7]) ## append values
a

{1, 2, 3, 5, 6, 7}

In [26]:
a - b

{1, 5, 6, 7}

In [27]:
b - a

{4}

In [28]:
a&b ## intersection

{2, 3}

In [29]:
a|b ## union

{1, 2, 3, 4, 5, 6, 7}

In [30]:
# Dictionary

x = {'a':1,'b':[2,3,4],'c':{'d':[1,2,3]}}

In [31]:
x['a']

1

In [32]:
x['c']['d']

[1, 2, 3]

In [33]:
x['e'] = 5 ## add new entry
x

{'a': 1, 'b': [2, 3, 4], 'c': {'d': [1, 2, 3]}, 'e': 5}

* Pandas dataframe

In [66]:
import pandas as pd ## import the pandas library

# show complete dataframe content
pd.options.display.max_columns = None
pd.options.display.max_rows = None
pd.set_option('display.max_colwidth', -1)

In [14]:
df = pd.DataFrame({'name':['ABC','DEF','GHI','JKL'],'age':[20,30,40,50]}) # create from a dictionary
df

Unnamed: 0,name,age
0,ABC,20
1,DEF,30
2,GHI,40
3,JKL,50


In [16]:
names = ['ABC','DEF','GHI','JKL']
ages = [20,30,40,50]
df = pd.DataFrame(zip(names,ages), columns=['name','ages'])
df

Unnamed: 0,name,ages
0,ABC,20
1,DEF,30
2,GHI,40
3,JKL,50


In [17]:
df.loc[:,'name'] # slice by column name

0    ABC
1    DEF
2    GHI
3    JKL
Name: name, dtype: object

In [18]:
df.iloc[:,0] # slice by column index

0    ABC
1    DEF
2    GHI
3    JKL
Name: name, dtype: object

In [19]:
df.loc[df.name=='ABC'] # slice by condition

Unnamed: 0,name,ages
0,ABC,20


In [20]:
df.ages.describe()

count     4.000000
mean     35.000000
std      12.909944
min      20.000000
25%      27.500000
50%      35.000000
75%      42.500000
max      50.000000
Name: ages, dtype: float64

In [21]:
df2 = pd.DataFrame({'name':['DEF','GHI','JKL'], 'hometown':['Atlanta, GA', 'Atlanta, GA', 'Knoxville, TN']})
df2

Unnamed: 0,name,hometown
0,DEF,"Atlanta, GA"
1,GHI,"Atlanta, GA"
2,JKL,"Knoxville, TN"


In [22]:
df_combo = pd.concat([df,df2],axis=0,sort=False) # stack two dataframes
df_combo

Unnamed: 0,name,ages,hometown
0,ABC,20.0,
1,DEF,30.0,
2,GHI,40.0,
3,JKL,50.0,
0,DEF,,"Atlanta, GA"
1,GHI,,"Atlanta, GA"
2,JKL,,"Knoxville, TN"


In [23]:
df_combo = pd.merge( # join dataframes
    df,
    df2,
    on='name',
    how='left'
) ## Other tools are available to do sql like operation on dataframe (https://pypi.org/project/pandasql/)
df_combo

Unnamed: 0,name,ages,hometown
0,ABC,20,
1,DEF,30,"Atlanta, GA"
2,GHI,40,"Atlanta, GA"
3,JKL,50,"Knoxville, TN"


In [24]:
df_combo

Unnamed: 0,name,ages,hometown
0,ABC,20,
1,DEF,30,"Atlanta, GA"
2,GHI,40,"Atlanta, GA"
3,JKL,50,"Knoxville, TN"


In [25]:
df_combo.groupby('hometown').size().reset_index() # simple statistics

Unnamed: 0,hometown,0
0,"Atlanta, GA",2
1,"Knoxville, TN",1


In [26]:
df_combo.groupby(['name','hometown']).size().reset_index().rename(columns={0:'frequency'})

Unnamed: 0,name,hometown,frequency
0,DEF,"Atlanta, GA",1
1,GHI,"Atlanta, GA",1
2,JKL,"Knoxville, TN",1


In [27]:
df_combo['num_pets'] = [1,2,2,3] # create a pivot table
df_combo.pivot_table(
    index='name',
    columns='hometown',
    values='num_pets',
    aggfunc='sum'
).fillna(0)

hometown,"Atlanta, GA","Knoxville, TN"
name,Unnamed: 1_level_1,Unnamed: 2_level_1
DEF,2.0,0.0
GHI,2.0,0.0
JKL,0.0,3.0


* Function definition

        Regular function

In [49]:
def helloworld(name):
    print('My name is {}'.format(name))
    # print('My name is %s' % name)

helloworld('Bot')

My name is Bot


        Lambda function

In [50]:
helloworld2 = lambda name: print('My name is {}'.format(name))
helloworld2('Robot')

My name is Robot


* Class definition

In [53]:
class table(object):
    """
    Input table dimensions, calculate table properties    
    Parameters
    ----------
    length: int, table length
    width: int, table width
    height: int, table height
    """
    WHOAMI = 'A table'
    def __init__(self, length, width, height):
        self.length = length
        self.width = width
        self.height = height
    def toparea(self):
        return(self.length * self.width)

In [55]:
tb = table(2,3,4)
tb = table(length=2, width=3, height=4)
print(tb.WHOAMI)
print(tb.length, tb.width, tb.height)
print(tb.toparea())

A table
2 3 4
6


* Value assignment

In [57]:
a = 6
a

6

In [64]:
b = a # creat a new copy
b

[1, 4, 3]

In [59]:
a = [1,2,3]
b = a # create a reference
b

[1, 2, 3]

In [61]:
a[1] = 4
b

[1, 4, 3]

In [65]:
table_a = table(2,3,4)
table_b = table_a # create a reference
table_a.length = 5
table_b.length

5

### 1.2. Control statement

* Loop

In [2]:
for i in range(2): # i could be 0-2
    print(i)

0
1


In [3]:
i = 0
while i<3:
    i+=1
print(i)

3


In [4]:
## a better visualization!!
from tqdm import trange
import time

a = 0
for i in trange(100):
    time.sleep(0.1)
    a = a + 1
print(a)

100%|██████████| 100/100 [00:10<00:00,  9.57it/s]

100





* Condition structure

In [5]:
values = [1,2,3,4]
for value in values:
    if value%2==0:
        print(value)

2
4


In [7]:
x = 5
output = 1 if x<3 else 0 ## conditional value assignment
output

0

In [8]:
## combination of loop and condition
output = [
    i
    for i in trange(10) if i%2==0
] # list comprehension
output

100%|██████████| 10/10 [00:00<00:00, 47554.47it/s]


[0, 2, 4, 6, 8]

### 1.3. Package manager

* pip (example)
* conda

### 1.4. Virtual Environment
* virtualenv (example)
* conda

## 2. File Input/Output

* Read a file
    * President Trump's inauguration speech (2017)

In [37]:
with open('../data/trump_inauguration.txt','r') as f:
    for index,line in enumerate(f.readlines()):
        if index<4:
            print(line)

Chief Justice Roberts, President Carter, President Clinton, President Bush, President Obama, fellow Americans, and people of the world: thank you.



We, the citizens of America, are now joined in a great national effort to rebuild our country and to restore its promise for all of our people.





* Write to a file
    * Example: Count word frequency and write result to a text file

In [78]:
## count the word frequency
dict_freq = {}

with open('../data/trump_inauguration.txt','r') as f:
    for line in f.readlines():
        words = line.lower().split()
        for word in words:
            word2 = word.strip('"').strip("'").replace(',','').replace('.','').replace(';','').replace('–','')
            if len(word2)>0:
                if dict_freq.get(word2,-10000)==-10000:
                    dict_freq[word2]=1
                else:
                    dict_freq[word2]=dict_freq[word2]+1

In [79]:
## write to txt file
with open('../data/word_count_out.csv','w') as f:
    for key,value in dict_freq.items():
        f.write('{word},{freq}\n'.format(word=key,freq=value))

In [80]:
## check the file content
import subprocess
result = subprocess.check_output('head -5 ../data/word_count_out.csv',shell=True).decode('utf-8')
print(result)

chief,1
justice,1
roberts,1
president,5
carter,1



* File analysis with Pandas dataframe
    * Example: Find the most frequent words

In [81]:
df = pd.read_csv('../data/word_count_out.csv',sep=',',header=None,names=['word','freq'])
df.head(3)

Unnamed: 0,word,freq
0,chief,1
1,justice,1
2,roberts,1


In [82]:
from sklearn.feature_extraction import stop_words
df2 = df.loc[~df.word.isin(
    stop_words.ENGLISH_STOP_WORDS
)]

In [83]:
df2.sort_values(by='freq',ascending=False).head(5)

Unnamed: 0,word,freq
19,america,17
101,american,12
31,country,9
11,people,9
25,great,6


## 3. Database Connection

* Common tools
* MySQL example
* AWS example

## 4. Webpage crawling

* Common tools
* Regex match
* Example (www.advantage.com, grab all car rental addresses)


In [None]:
IFrame("https://www.advantage.com/us-location/", width=1400, height=400)

## 5. Data Visualization

* 5.1. X-Y plot

* 5.2. Bar chart

* 5.3. Histogram

* 5.4. Heatmap

* 5.5. Visualization on geolocation map

## 6. Model Fitting Examples

* 6.1. Linear Regression (randomly generated sample data)

* 6.2. k-Means (Iris dataset)

* 6.3 PCA analysis on an image