<p><a name="sections"></a></p>


# Sections

- <a href="#simple">Simple Values and Expressions</a><br>
- <a href="#dataTypes">Python Data Types</a><br>
- <a href="#lambda">Lambda Functions and Named Functions</a><br>
- <a href="#loops">Loops in Python</a><br>
- <a href="#ifelse">If-Else Statements</a><br>
- <a href="#numpypandas">Numpy and Pandas Packages</a><br>
- <a href="#numpypandas">Pandas Join-Functionality</a><br>
- <a href="#writefile">Writing to File</a><br>


<p><a name="simple"></a></p>
# Simple Values and Expressions

** Expressions and Values**

We demonstrate the simplest expression syntax: Operators +, -, * and / work just like in most other languages; parentheses can be used for grouping.

In [483]:
# Comment in Python starts with the hash character, # , and extend to the end of the physical line

1 + 2 * 3     # * has precedence over +

7

- iPython notebook shows only the last statement in each cell. To inspect all of them, we may use the `print` statement:

In [484]:
print(1 + 2 * 3)
print((1 + 2) * 3)

7
9


In [96]:
print(23 / 5)     
print(23 / 3.0)   # In Python 2, you must divide int / float -> float 
print(23 // 5.0 ) # explicit integer division 
print(23 % 5)     # remainder
print(2 ** 7)     # 2 to the power of 7

4.6
7.666666666666667
4.0
3
128


**Syntactic note**:  Python does not close it's statements with any special character (like C++). Therefore if you want to split a line of code, you need to use a **'\'**. 

In [137]:
print(11 * (5 * 3 - 5) + 4 / 3 ** 2 - 1)

print (8 + (7 + 6 * 5)    # use parentheses
       + 4 / 3 ** 2 - 1)

print(8 * (7 + 6 - 5) \
      + 4 / 3 ** 2 - 1)       # use backslash

109.44444444444444
44.44444444444444
63.44444444444444


- **Note**: Indentation is how Python closes loops and statements. Python code will not run if it's not indented right.

**Varaiables**

Variables assign values to characters/words.  We use the "=" sign to make this assignemnt. This is in contrast to how we test whether something is equal, where we use the double "==".

In [140]:
tax = 8.5 / 100   # An “assignment statement”
price = 100.50

iPython notebook print nothing for assignments, as we see from above.

In [141]:
price * tax

8.5425

Test if these two expressions are equal:

In [99]:
a = 4
b = 5
c = 4

print(a == b)
print(a == c)

False
True


**Built-In Python Functions**

- Python comes with a number of built in functions

In [142]:
#absolute value
print(abs(-12.0), '\n')
#length of a list (which we'll cover shortly)
print(len([1, 2, 3, 4, 5]))
#Set of unique values in a list 
print(set([1, 2, 3, 4, 5, 5, 4]))

12.0 

5
{1, 2, 3, 4, 5}


You can also import **modules** which are packages that provided extra functionality

- You use the functions in a module by importing the module and using its name plus the function’s name:

In [146]:
import math          # import the math module
print(math.sqrt(720), '\n')    # square root of 720
print(math.pow(5, 2), '\n')    #5 to the power 2

26.832815729997478 

25.0 



- Or, use a different import syntax and use the function name alone:

In [150]:
from math import factorial, pow
print(factorial(6), '\n')       #We no longer need to state the package name prior to declaring the function
print(pow(5, 2), '\n')

720 

25.0 



**Naming convention**

- Python variable names are case sensitive (VaRiAbLe_NaMe does not equal variable_name)

- They can contain all alpha numeric characters [A-Z0-9], but shoudl always start with letters

- dict, set, list are examples of reserve words which cannot be used as variable names

- By convention, Python variables usually start with lower-case letters.  Variables should have descriptive names; for multi-word names, separate the words by underscores.
 - Good names:  column_1, row_2, key_variable, first_name
 - One-letter names are used in certain circumstances - e.g. i, j, k when used as indexes - but are otherwise frowned upon.

**Exercise 1**

In the input panel, run Python commands:

- Calculate 10 / 3. 
- Calculate 10 / -3  and compare it with the previous result. What do you notice?
- Calculate 2 to the power of 3.
- Now import the math module and try calculating a square root. Use the function math.sqrt(number). Try any number!


<p><a name="dataTypes"></a></p>
# Python Data Types

**Data Types**

- Python has a number of built-in data types
- Most of these you'll remember from basic programming classes
- A couple of them we'll touch base on later in the tutorial (as they require specific packages)
- We'll demonstrate all of them!
- The first ones we'll cover are **bools**, **ints** and **floats**

In [151]:
a = 5
b = 2.333
c = True

print(type(a))
print(type(b))
print(type(c))

<class 'int'>
<class 'float'>
<class 'bool'>


In [152]:
#Let's do some operations with these. First we'll add an int + float and see that we have <class 'float>
d = a + b

print(type(d))

<class 'float'>


In [153]:
#Now let's try adding an int + bool
e = c + b

#Python 3 evaluates bools as == 1, hence we can add these!
print(e)

3.333


**Data Types**

- Next we'll look at **lists**, **sets** and **tuples**
- lists are the most commonly used data type in Python
- lists and sets are **mutable** objects 
- **mutable** means the content can be changed after they are created
- **tuples** are immutable

In [154]:
#A list is an ordered collection of objects of variables which DO NOT have to be of the same data type
list_ = [1, 2, 3, 5.40, 'a', 2, 2]
print(list_)

#Tuples are two values in a pairing
tuple_ = (1, 2)
print(tuple_)

#A set is the unique values in a list
set_ = set(list_)
print(set_)

[1, 2, 3, 5.4, 'a', 2, 2]
(1, 2)
{1, 2, 3, 5.4, 'a'}


In [155]:
#Here's how we subset lists
print(list_[0])
print(list_[0:3])
print(tuple_[0])

#Sets are unordered, hence we can't take slices like in a list
print(set_[3])

1
[1, 2, 3]
1


TypeError: 'set' object does not support indexing

In [156]:
#Lists are mutable. This means the object can be changed after it's been created.
list_[0] = 5
print(list_)

#Tuples are immutable
tuple_[0] = 5

[5, 2, 3, 5.4, 'a', 2, 2]


TypeError: 'tuple' object does not support item assignment

In [170]:
#You can add elements to a list by using the 'append()' function
list_a = []
list_a.append(1)
list_a.append('a')
list_a.append(2)
list_a.append('b')

print(list_a)

[1, 'a', 2, 'b']


**Data Types**

- Next we'll look at **dictionaries** and **strings**
- **dictionaries** are the Python implementation of hash tables
- they require key-value pairs
- The keys must be immutable data types (ints, floats, strings, tuples)
- The pairs can be anything (ints, strings, lists, other dictionaries)
- strings have their own functionality which could be covered in a completely separate session
- We'll show just a few basics right now

In [171]:
dict_ = {'a': 1, 'b': [2, 3, 4], 'c': 3, 'd': 4}
print(dict_)
#Select the correct key to see the value associated with the 'b' key
print(dict_['a'])
print(dict_['b'])

{'a': 1, 'b': [2, 3, 4], 'c': 3, 'd': 4}
1
[2, 3, 4]


In [172]:
#Strings have a lot of built in functionality
string_a = 'Quick brown fox jumped over the lazy dogs '
string_b = 'and ran away from the hunter'
print(string_a.lower()) #convert all characters to lower case
print(string_a.upper()) #convert all characters to lower case

print(string_a + string_b, '\n') #strings can be added with an '+' operator

split_string = string_a.split() #strings can be split by calling the 'split' function

#Here you see we can both split, and then re-join the strings!
print(split_string, '\n')
print("".join(split_string))

quick brown fox jumped over the lazy dogs 
QUICK BROWN FOX JUMPED OVER THE LAZY DOGS 
Quick brown fox jumped over the lazy dogs and ran away from the hunter 

['Quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dogs'] 

Quickbrownfoxjumpedoverthelazydogs


String functionality could be an entire separate class, but for now please visit https://www.tutorialspoint.com/python/python_strings.htm for your questions on strings!

Having reviewed the basics of **Python operations** and **data types**, we'll now take a look at how to **declare functions**

**Exercise 2**

Let's reverse the words in a string!:

- String is, "Welcome to grandma and grandpa's 50th Anniversary Celebration!"
- First, separate the words using the str.split() function
- Next, reverse the resulting list using the list index operator
- Finally, join your new list together separating the words with a blank space
- And print...


In [173]:
#Answer
s = "Welcome to grandma's 80th Birthday Celebration!"
splits = s.split()
rev = splits[::-1]
result = " ".join(rev)
print(result)

Celebration! Birthday 80th grandma's to Welcome


<p><a name="lambda"></a></p>
# Lambda Functions and Named Functions

**Defining Functions**

- There are two ways do define functions in Python
- The first is the traditional named-function format
- The second is as a one-line anonymous function
- We'll demonstrate both!

In [174]:
#The first way is using def func_name(inputs):
def take_power(a, b):
    return a**b

print(take_power(2, 3))

8


In [175]:
#The other way to declare this function is using the "lambda" method
g = lambda a, b: a**b
g(2, 3)

8

In [176]:
#Lambda functions are often wrapped in other Python fuctionality, like "map", which applies a function across a specified range of values
x = map(lambda a: a**a, [1, 2, 3, 4, 5, 6])
print(list(x))

[1, 4, 27, 256, 3125, 46656]


**List Comprehensions**

- These act as in-line for-loops
- While much of the functionality is similar, List Comprehensions sometimes provide a more readable format
- Due to their one-line syntax, they are easily incorporated in functions 

In [177]:
#Let's look at an example of a list-comprehension
w = [5, 5, 5, 5, 5, 5, 5 , 5, 5]
x = [1, 2, 3, 4, 5, 6, 7 , 8, 9]

#Example 1
print([i for i in w])

#Example 2 incorporating an 'if' statement
[i if i % 2 == 0 else -9999 for i in x]

[5, 5, 5, 5, 5, 5, 5, 5, 5]


[-9999, 2, -9999, 4, -9999, 6, -9999, 8, -9999]

**Exercise 3**

Let's try something a little tricky.

- Create a function that takes in two lists, combines them into a new list, and sorts it biggest to smallest, and add a 0 to the end
- Use the list.append() function in your answer


In [178]:
def list_fun(a,b):
    new_list = a+b
    new_list.sort(reverse=True)
    new_list.append(0) # the append function adds the object in parentheses to the end of the list
    return new_list
    
list_fun([1,5,4], [22, 2, 222])

[222, 22, 5, 4, 2, 1, 0]

<p><a name="loops"></a></p>
# Loops in Python

**How to Define Loops**

- Python has no terminating characters
- This means that all 'termination' is done by indentation
- In Python, we can loop through actual values of an object, OR integers
- We'll demonstrate both!

In [167]:
#Let's continue with the example from above (list comprehensions)
for i in w:
    print(i, end=" ")

5 5 5 5 5 5 5 5 5 

In [168]:
#We see here that we've printed the actual values in our list 'w'. Now let's loop through a range of integers
#The 'range' function is a built-in function for Python. It generates a list of integers in the range which is passed to it
#'range' values must be wrapped in list-data types

r1 = range(10)
r2 = range(20, 30)
r3 = range(100, 90, -1)

print(list(r1), '\n')
print(list(r2), '\n')
print(list(r3), '\n')

#We see here that we are now iterating over the length of the 'w' list, as opposed to the actual values. It doesn't make much 
#of a difference here, but when we need to pass column values in dataframes, or string values to dictionaries, it's quite helpful
for i in range(len(w)):
    print(i, end=" ")

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9] 

[20, 21, 22, 23, 24, 25, 26, 27, 28, 29] 

[100, 99, 98, 97, 96, 95, 94, 93, 92, 91] 

0 1 2 3 4 5 6 7 8 

**Exercise 4**

Looping:

Let's try writing a loop, first the standard way, and after, the same loop with a lambda function.

- Declare both a list of integers (my_list) and an empty list (new_list)
- Write a loop that multiplies each item in the list 'my_list' by 2, and adds it to 'new_list'
- print the 'new_list'

In [181]:
my_list = [2, 5, 12, 3.0, 11]
new_list = []
for item in my_list:
    times_two = item*2
    new_list.append(times_two)
print(new_list)

[4, 10, 24, 6.0, 22]


In [183]:
my_list = [2, 5, 12, 3.0, 11]
new_list = map(lambda x: x*2, my_list)
print(list(my_list))

[2, 5, 12, 3.0, 11]


<p><a name="ifelse"></a></p>
# If-Else Statement in Python

**How to Write an If-Else Statement**
F
- The general syntax is straightforward
- Make sure the indenting is correct

In [184]:
#This is the general construction of an if-else statement
from random import randint
y = randint(0, 10)

if y < 3:
    print('y is less than 3!: ', y)
elif y >= 3 and y < 7:
    print('y is between 3 & 6!: ', y)
else:
    print('y is greater than 6!: ', y)

y is less than 3!:  1


**Python has another set of functionality call "Try-Except"**

**This is commonly used in loops and if-else statements, and is quite handy!**

- Try-except statements allow us to 'try' to perform an operation
- if we fail to complete this operation, we simply proceed to the next statement without crashing our code
- this is especially handy when parsing websites, where data may be delayed in loading
- 'except' statements can specify a number of specific errors which may occur
- once these errors have occured, specific corrective action can then be taken 
- for a complete list of exceptions and how to handle them, please visit https://wiki.python.org/moin/HandlingExceptions !

In [185]:
#For this example, we'll modify our if-else statement to include integers outside of the specified range

#Create a 'for' loop which will iterate 50 times
for i in range(50):
    #For each loop, randomly select and x & y in the given ranges
    x = randint(0, 20)
    y = randint(0, 1)
    #if the y-value is zero, divide x by y
    try:
        print(x/y)
    #otherwise print'Dividing by Zero'
    except ZeroDivisionError:
        print('Dividing by Zero')
    

Dividing by Zero
Dividing by Zero
20.0
14.0
Dividing by Zero
16.0
8.0
5.0
5.0
13.0
11.0
17.0
9.0
Dividing by Zero
Dividing by Zero
10.0
Dividing by Zero
Dividing by Zero
Dividing by Zero
Dividing by Zero
Dividing by Zero
Dividing by Zero
7.0
Dividing by Zero
5.0
Dividing by Zero
Dividing by Zero
0.0
Dividing by Zero
Dividing by Zero
2.0
9.0
16.0
Dividing by Zero
Dividing by Zero
17.0
Dividing by Zero
Dividing by Zero
20.0
Dividing by Zero
Dividing by Zero
10.0
Dividing by Zero
16.0
Dividing by Zero
Dividing by Zero
Dividing by Zero
16.0
Dividing by Zero
11.0


<p><a name="numpypandas"></a></p>
# Numpy and Pandas packages

**Numpy & Pandas are two fundamental packages**

- Numpy provides a host of array functions
- Pandas is the main data frame package in Python
- Pandas dataframes are Python objects
- In the coming section we will explore some of their functionality

#### Numpy (Array & Matrix Operations)

"NumPy’s main object is the homogeneous multidimensional array. It is a table of elements (usually numbers), all of the same type, indexed by a tuple of positive integers. In NumPy dimensions are called axes. The number of axes is rank."

https://docs.scipy.org/doc/numpy-dev/user/quickstart.html

In [2]:
import numpy as np

a = np.arange(12).reshape(3, 4)
print("Matrix a's shape: ", a.shape, '\n')
print("Matrix a's dimensions: ", a.ndim, '\n')
print("Matrix a's size: ", a.size, '\n')

Matrix a's shape:  (3, 4) 

Matrix a's dimensions:  2 

Matrix a's size:  12 



In [3]:
#Arrays are declared in the following fashion
b = np.array([2, 3, 4, 5, 6])
c = np.array([[2, 3, 4, 5, 6], [7, 8, 9, 10, 11]]) #Notice the double outside brackets
print(b, '\n')
print(c, '\n')

#Declare an array array of zeros
d = np.zeros((3, 4))
print('Array of Zeroes')
print(d, '\n')

#Declare a matrix of ones
e = np.ones((3, 4))
print('Array of Ones')
print(e)

[2 3 4 5 6] 

[[ 2  3  4  5  6]
 [ 7  8  9 10 11]] 

Array of Zeroes
[[ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]] 

Array of Ones
[[ 1.  1.  1.  1.]
 [ 1.  1.  1.  1.]
 [ 1.  1.  1.  1.]]


In [4]:
#Arithmetic operators on arrays apply elementwise. A new array is created and filled with the result.
print(d + 3.33, '\n') #See that 3.33 is added to each element in the array

print(e * 2, '\n') # '*' operator works elementwise

[[ 3.33  3.33  3.33  3.33]
 [ 3.33  3.33  3.33  3.33]
 [ 3.33  3.33  3.33  3.33]] 

[[ 2.  2.  2.  2.]
 [ 2.  2.  2.  2.]
 [ 2.  2.  2.  2.]] 



In [5]:
#Many unary operations, such as computing the 
#sum of all the elements in the array, are implemented as methods of the ndarray class.
print('Sum a matrix: ', e.sum(), '\n') #'e' was a 3x4 matrix of 1's
print('Max of a matrix: ', e.max(), '\n')
print('Min of a matrix: ', e.min(), '\n') #Max = Min since the matrix is only 1's

f = np.arange(4)
print('Array F', '\n')
print(f, '\n')
print('Exponent of matrix 'F': ', np.exp(f), '\n')
print('Square of matrix 'F': ', np.sqrt(f), '\n')

Sum a matrix:  12.0 

Max of a matrix:  1.0 

Min of a matrix:  1.0 

Array F 

[0 1 2 3] 

Exponent of matrix :  [  1.           2.71828183   7.3890561   20.08553692] 

Square of matrix :  [ 0.          1.          1.41421356  1.73205081] 



In [6]:
#1-dimensional arrays can be sliced and indexed, similar to a list
g = np.arange(20)
print('Array G', '\n')
print(g, '\n')
print('The 4th element of G: ', g[3], '\n')
print('The 3rd - 5th elements of G: ', g[2:5])

Array G 

[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19] 

The 4th element of G:  3 

The 3rd - 5th elements of G:  [2 3 4]


Numpy arrays support extensively more functionality for matrix indexing and operations. For more on these data types, please reference https://docs.scipy.org/doc/numpy-dev/user/quickstart.html

#### Pandas (Dataframe objects)

Pandas is the primary dataframe package in Python. Dataframes are declared and treated as objects, with a host of useful functionality. We'll take a look at some of the most common uses and functions, but for a full deep dive into Pandas, please visit the Pandas documentation website:

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html

In [7]:
#Declaring Pandas dataframes can be done in a number of ways. We can declare dataframes from numpy arrays, 
#dictionaries, or csv/text files.
import os
import json
import pandas as pd

#Create a pandas dataframe from numpy arrays
a = np.array([[1, 2, 3, 4, 5],
             ['a', 'b', 'c', 'd', 'e'],
             [7, 8, 9, 10, 11]]).reshape(5, 3)
b = ['col_1', 'col_2', 'col_3']
c = ['val_1', 'val_2', 'val_3', 'val_4', 'val_5']

array_df = pd.DataFrame(data = a, columns = b, index =c)
print(array_df, '\n')

#Declare a Pandas dataframe from a dictionary
students = {'Kelly': ['A', 'B+', 'C-', 'A-'],
           'Mike': ['B', 'C+', 'A-', 'A-'],
           'Pete': ['C', 'D+', 'B-', 'B'],
           'Juan': ['A', 'A+', 'B-', 'B+']}
dict_df = pd.DataFrame.from_dict(students, orient='columns') #default is for the dictionary keys to be columns. For the keys 
print(dict_df, '\n')                                              #to be row indices, specify orient='index'

csv_df = pd.DataFrame.from_csv('social_data.csv', index_col = 'timestamps')
print(csv_df.head(), '\n')

      col_1 col_2 col_3
val_1     1     2     3
val_2     4     5     a
val_3     b     c     d
val_4     e     7     8
val_5     9    10    11 

  Juan Kelly Mike Pete
0    A     A    B    C
1   A+    B+   C+   D+
2   B-    C-   A-   B-
3   B+    A-   A-    B 

                              names   platform
timestamps                                    
2015-08-30 07:56:28     Olivia Munn    Youtube
2015-08-30 08:56:28  Vivika Salazar   Snapchat
2015-08-30 09:56:28      David Chen    Youtube
2015-08-30 10:56:28   Juan Williams  Instagram
2015-08-30 11:56:28     Olivia Munn    Twitter 



#### Pandas Dataframe Operations

Now that we've seen how to create a Pandas dataframe from several different inputs, let's take a look at some of the operations we can perform on a dataframe

In [9]:
#To check only the first few rows of a dataframe, we can call the '.head()' function
print('.head() function: ')
print(csv_df.head(), '\n')

#Or to check the last few entries, we can use the '.tail()' function
print('.tail() function: ')
print(csv_df.tail(), '\n')

#If we'd like to check and see if there are any 'NA's' in our dataframe, we can simply call the '.isnull()' function followed by 
#the '.sum()' function
print("Let's see if there are any Null values in our dataframe: ", '\n')
print(csv_df.isnull().sum(), '\n')

#Now we can check the shape (dimensionality of our dataframe)
print('Check the shape of the dataframe: ')
print(csv_df.shape, 'rows x columns', '\n')

#What if we want to see the most frequent value in the 'names' column?
print("Let's look at the mode: ", csv_df['names'].mode(), '\n')

.head() function: 
                              names   platform
timestamps                                    
2015-08-30 07:56:28     Olivia Munn    Youtube
2015-08-30 08:56:28  Vivika Salazar   Snapchat
2015-08-30 09:56:28      David Chen    Youtube
2015-08-30 10:56:28   Juan Williams  Instagram
2015-08-30 11:56:28     Olivia Munn    Twitter 

.tail() function: 
                              names   platform
timestamps                                    
2017-08-09 09:52:46  Josh Escalante    Youtube
2017-08-09 10:52:47  Josh Escalante    Youtube
2017-08-09 11:52:57   Juan Williams    Twitter
2017-08-09 12:53:06   Gerald Butler  Instagram
2017-08-09 13:23:06  Vivika Salazar   Snapchat 

Let's see if there are any Null values in our dataframe:  

names       0
platform    0
dtype: int64 

Check the shape of the dataframe: 
(17006, 2) rows x columns 

Let's look at the mode:  0    Olivia Munn
dtype: object 



#### Pandas subsetting operations 

Pandas dataframes can be subset similar to R dataframes (if you're familiar with R). The syntax is straightforward and intuitive. We'll now look at some examples

In [10]:
#Let's select the 'Platform' column of the csv_df dataframe
print(csv_df['platform'].head(), '\n')  #The '.head()' is for simplicities sake

#Now let's select the entire csv_df dataframe where 'platform' = Facebook
print(csv_df[csv_df['platform'] == 'Facebook'].head(), '\n')

#What if we want to select the values all the 'names' entries in the 'names' column where she logged into 'Facebook' in the 
#'platform' column?
print(csv_df['names'][csv_df['platform'] == 'Facebook'].head(), '\n')

#Next, let's combine two conditions - let's select all the 'names' which equal 'Olivia Munn' and all the 'platform' values
#that equal 'Instagram':

print(csv_df[(csv_df['names'] == 'Olivia Munn') & (csv_df['platform'] == 'Instagram')].head(), '\n')

#What if we want to select a specific column/rows by the place in the index (number)? For these operations, we can use '.iloc' 
#and '.loc':

#Let's select the 1st 5 'names' values. iloc is used to select values by index number
print(csv_df.iloc[0:5, 0], '\n')

#'loc' is used to select values by index name. Let's select the '2015-08-30 13:56:28' entry in our index
csv_df.loc['2015-08-30 13:56:28']
#And we see that have all the information for this row in our index

timestamps
2015-08-30 07:56:28      Youtube
2015-08-30 08:56:28     Snapchat
2015-08-30 09:56:28      Youtube
2015-08-30 10:56:28    Instagram
2015-08-30 11:56:28      Twitter
Name: platform, dtype: object 

                              names  platform
timestamps                                   
2015-08-30 13:56:28   Gerald Butler  Facebook
2015-08-30 15:56:28  Josh Escalante  Facebook
2015-08-30 23:56:28     Olivia Munn  Facebook
2015-08-31 03:56:28  Josh Escalante  Facebook
2015-08-31 07:56:28      David Chen  Facebook 

timestamps
2015-08-30 13:56:28     Gerald Butler
2015-08-30 15:56:28    Josh Escalante
2015-08-30 23:56:28       Olivia Munn
2015-08-31 03:56:28    Josh Escalante
2015-08-31 07:56:28        David Chen
Name: names, dtype: object 

                           names   platform
timestamps                                 
2015-09-02 05:56:28  Olivia Munn  Instagram
2015-09-08 10:56:28  Olivia Munn  Instagram
2015-09-12 09:56:28  Olivia Munn  Instagram
2015-09-13 16:56:2

names       Gerald Butler
platform         Facebook
Name: 2015-08-30 13:56:28, dtype: object

In [11]:
#Now let's perform some subsetting operations on the 'csv_df' dataframe
#How about we select all of the 'Josh Escalante' entries in the dataframe
print("Subset according to 'names' = 'Josh Escalante': ", '\n')
#Now we a separate dataframe that is a subset of the original, where 'names' = 'Josh Escalante'
josh = csv_df[csv_df['names'] == 'Josh Escalante']
print(josh.head(), '\n')

#What if we want to see what the most frequently visited platforms for each person are?
print('Most frequently visited: ', '\n')
print(csv_df.groupby(csv_df['platform']).count(), '\n')

#let's what what the most often used platforms are:
print('The most often used platforms are: ', '\n')
csv_df['platform'].value_counts()

#What if we want to see how many times each person used each of the platforms?
#We can do this two ways. The longer, but "subsettable" way would be like this:
print('How many times has each person used these platforms?', '\n')
for name in set(csv_df['names']):
    print(name, '\n', csv_df[csv_df['names'] == name]['platform'].value_counts(), '\n')
    
#The cleaner way would be the following:
print(csv_df.groupby(['names', 'platform'])['platform'].count())

Subset according to 'names' = 'Josh Escalante':  

                              names   platform
timestamps                                    
2015-08-30 15:56:28  Josh Escalante   Facebook
2015-08-30 17:56:28  Josh Escalante  Instagram
2015-08-31 03:56:28  Josh Escalante   Facebook
2015-08-31 12:56:28  Josh Escalante    Youtube
2015-09-01 05:56:28  Josh Escalante    Twitter 

Most frequently visited:  

           names
platform        
Facebook    3416
Instagram   3467
Snapchat    3349
Twitter     3346
Youtube     3428 

The most often used platforms are:  

How many times has each person used these platforms? 

Vivika Salazar 
 Youtube      517
Snapchat     478
Facebook     470
Instagram    469
Twitter      458
Name: platform, dtype: int64 

Josh Escalante 
 Youtube      510
Snapchat     494
Instagram    491
Twitter      468
Facebook     450
Name: platform, dtype: int64 

David Chen 
 Instagram    525
Facebook     510
Snapchat     506
Youtube      464
Twitter      455
Name: platfo

In [12]:
#Let's now answer some additonal questions surrounding our data set

#1. How many days are we observing in our data set?

#Let's first add the individual days as a column to our dataframe
csv_df['date'] = csv_df.index.date

#Now let's take the lenght of the set of the values in this column
print(len(set(csv_df['date'])))

711


In [13]:
#2.A Which users had the most active social media sessions?
print('Answer 2.A: ', '\n')
print(csv_df['names'].value_counts(), '\n')

#2.B Which users were active the most number of days?
for i in set(csv_df['names']):
    print(i, ": ", len(set(csv_df['date'][csv_df['names']==i])))

Answer 2.A:  

Olivia Munn          2495
Gerald Butler        2469
David Chen           2460
Josh Escalante       2413
Juan Williams        2400
Vivika Salazar       2392
Teddy Bridgewater    2377
Name: names, dtype: int64 

Vivika Salazar :  691
Josh Escalante :  696
David Chen :  693
Olivia Munn :  692
Gerald Butler :  697
Juan Williams :  698
Teddy Bridgewater :  697


**Exercise 5**

Determine the following:

- Who used Facebook most frequntly?
- Answer 2.B (Which users were active the most number of days?) using the "groupby" function

In [14]:
#Hint: very similar to what we did previously, I just passed an extra parameter to the filter so we only see the FB results
print('Method 1: ', '\n')
for i in set(csv_df['names']):
    print(i, ": ", csv_df['platform'][(csv_df['names']==i) & (csv_df['platform']=='Facebook')].value_counts(), '\n')
    
print('Method 2 ', '\n')
csv_df.groupby(['names', 'platform'])['date'].count()

Method 1:  

Vivika Salazar :  Facebook    470
Name: platform, dtype: int64 

Josh Escalante :  Facebook    450
Name: platform, dtype: int64 

David Chen :  Facebook    510
Name: platform, dtype: int64 

Olivia Munn :  Facebook    503
Name: platform, dtype: int64 

Gerald Butler :  Facebook    487
Name: platform, dtype: int64 

Juan Williams :  Facebook    498
Name: platform, dtype: int64 

Teddy Bridgewater :  Facebook    498
Name: platform, dtype: int64 

Method 2  



names              platform 
David Chen         Facebook     510
                   Instagram    525
                   Snapchat     506
                   Twitter      455
                   Youtube      464
Gerald Butler      Facebook     487
                   Instagram    514
                   Snapchat     473
                   Twitter      503
                   Youtube      492
Josh Escalante     Facebook     450
                   Instagram    491
                   Snapchat     494
                   Twitter      468
                   Youtube      510
Juan Williams      Facebook     498
                   Instagram    470
                   Snapchat     460
                   Twitter      484
                   Youtube      488
Olivia Munn        Facebook     503
                   Instagram    521
                   Snapchat     470
                   Twitter      519
                   Youtube      482
Teddy Bridgewater  Facebook     498
                   Instagram    477

In [15]:
#2.B
print('Method 1: ', '\n')
for i in set(csv_df['names']):
    print(i, " :", len(set(csv_df['date'][csv_df['names']==i])))
    
print('\n', 'Method 2: ')
csv_df.groupby(['names'])['date'].nunique()

Method 1:  

Vivika Salazar  : 691
Josh Escalante  : 696
David Chen  : 693
Olivia Munn  : 692
Gerald Butler  : 697
Juan Williams  : 698
Teddy Bridgewater  : 697

 Method 2: 


names
David Chen           693
Gerald Butler        697
Josh Escalante       696
Juan Williams        698
Olivia Munn          692
Teddy Bridgewater    697
Vivika Salazar       691
Name: date, dtype: int64

<p><a name="joins"></a></p>
# Pandas-Join functionality

**Pandas has merge, join & concatenate implemented similar to SQL**

- Pandas leverages SQL-type implementation to join data sets
- These functions provide excellent capability
- Pandas joins are very versatile

Visit https://pandas.pydata.org/pandas-docs/stable/merging.html for a complete overview!

#### There are four main join-types in Pandas:

|**Python**|  **SQL**  |**Description**|
|----------|-----------|---------------|
|**left**|**LEFT OUTER**|Match keys in the left data frame|
|**right**|**RIGHT OUTER**|Match keys in the right data frame| 
|**outer**|**OUTER JOIN**|Match keys in the both data frame|   
|**innter**|**INNER JOIN**|Match keys present in both data frames| 

- These joins are all present in SQL and we'll demonstrate how they can be implemented in Python. Let's look at some examples
- Suppose one of your friends wanted to get some stats surrounding 90's basketball players, and you (being a Python whiz) decided to help
- You first want to combine the tables in different ways to demonstrate to your friend just how good your skills are!

Let's see how we can do this

In [16]:
table_1 = pd.DataFrame.from_csv('nba_1.csv')
table_2 = pd.DataFrame.from_csv('nba_2.csv')


#Let's take a look at the columns names and data types
print('Table columns: ', '\n')
print(table_1.columns, '\n')
print(table_2.columns, '\n')

print('Table 1 data types: ', '\n', table_1.dtypes, '\n')
print('Table 2 data types: ', '\n', table_2.dtypes, '\n')

#Finally let's examine the first few lines of each data frame
print('Table_1: ')
print(table_1.head(), '\n')
print('Table_2', '\n')
print(table_2.head(), '\n')

Table columns:  

Index(['points', 'rebounds', 'assists'], dtype='object') 

Index(['city', 'state'], dtype='object') 

Table 1 data types:  
 points      int64
rebounds    int64
assists     int64
dtype: object 

Table 2 data types:  
 city     object
state    object
dtype: object 

Table_1: 
                 points  rebounds  assists
name                                      
Grant Hill         1640       656      492
Penny Hardaway     1804       738      410
Alonzo Mourning    1066       984      328
Charles Oakley      873       902      164
John Stockton      1211       410      902 

Table_2 

                           city state
name                                 
Alonzo Mourning           Miami    FL
Charles Oakley         New York    NY
John Stockton    Salt Lake City    UT
Karl Malone      Salt Lake City    UT
Michael Jordan          Chicago    IL 



- Now let's try joining table 1 and table 2 on the index
- The basic syntax for a 'merge' in Python is: 

**result = pd.merge(left, right, how='outer', on=['key1', 'key2'])**

Where **'how'** can be: **outer, inner, left or right** (similar to SQL)

If you're going to join on the column indices, specify which index is present for comparison. 

In the following examples we'll see that even though the dimensions of our data set are different, by specifying both **'right'** and **'left'** index, we can find the correct values to assing to each time series observation in our data set.

Let's take a look first at how to merge on the index of two dataframes:

In [17]:
#Let's join our tables
player_totals = pd.merge(table_1, table_2, left_index = True, right_index = True)
player_totals

Unnamed: 0_level_0,points,rebounds,assists,city,state
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Alonzo Mourning,1066,984,328,Miami,FL
Charles Oakley,873,902,164,New York,NY
John Stockton,1211,410,902,Salt Lake City,UT
Karl Malone,1524,1066,328,Salt Lake City,UT
Michael Jordan,1856,574,656,Chicago,IL
Clyde Drexler,1403,492,697,Houston,TX


- We see here that we were able to merge two dataframes according to the pandas merge default type, which is **"inner"**. We can see this because we only have 6 entries which were common to both data frames.

Now let's try some of these different joins when we're not using the index to join on and see what happens:

In [22]:
#First we'll make the index a column, and use that column to perform our right join
table_2.reset_index(inplace=True)
table_1.reset_index(inplace=True)

player_total_right = pd.merge(table_1, table_2, on = 'name', how = "right")
player_total_left = pd.merge(table_1, table_2, on = 'name', how = "left")
player_total_inner = pd.merge(table_1, table_2, on = 'name', how = "inner")
player_total_outer = pd.merge(table_1, table_2, on = 'name', how = "outer")

print('Right Join keeps all the data in the right table, and matches the left side against the right', '\n')
print(player_total_right, '\n')
print('Left Join does the opposite - matches the right table against the left', '\n')
print(player_total_left, '\n')
print('Inner Join only takes the common keys to both', '\n')
print(player_total_inner, '\n')
print('Outer Join keeps all the rows from both tables, filling in data when its available', '\n')
print(player_total_outer, '\n')

Right Join keeps all the data in the right table, and matches the left side against the right 

   index_x             name  points  rebounds  assists  index_y  \
0      2.0  Alonzo Mourning  1066.0     984.0    328.0        0   
1      3.0   Charles Oakley   873.0     902.0    164.0        1   
2      4.0    John Stockton  1211.0     410.0    902.0        2   
3      5.0      Karl Malone  1524.0    1066.0    328.0        3   
4      6.0   Michael Jordan  1856.0     574.0    656.0        4   
5      7.0    Clyde Drexler  1403.0     492.0    697.0        5   
6      NaN  Hakeem Olajuwan     NaN       NaN      NaN        6   
7      NaN  Mookie Blaylock     NaN       NaN      NaN        7   

             city state  
0           Miami    FL  
1        New York    NY  
2  Salt Lake City    UT  
3  Salt Lake City    UT  
4         Chicago    IL  
5         Houston    TX  
6         Houston    TX  
7         Atlanta    GA   

Left Join does the opposite - matches the right table against th

<p><a name="writefile"></a></p>
# Writing to File

#### Finally, how we do save a our data to a file? 

#### Writing to a CSV file is very simple!

All this analysis is great, but what about when we want to push a dataframe (or data) to a file? Fortunately for us, that's quite easy to do!

Let's start with writing to a CSV file. Pandas has a number of built-in functions that allow the user to write to CSV files, dictionaries (JSON), Excel files, and various other file types. They all follow this general syntax:  

`dataframe.to_data_type('filename.type')`

In [23]:
#Just type the dataframe name, the 'to_csv' function, then what you want to name it!
player_total_inner.to_csv('nba_df.csv')
#Now check your directory to see if it's there!

#### Writing to a text file

Python has handy functionality to write to text files as well. The general syntax is in the following form:

`file = open("filename", "mode")`   
where **"mode"** specifies whether to append data, overwrite data, or just read whats in the file.  

`a` : append data to a file (add the data to whatever is already there)  
`r` : read whatever is in the file (for pulling the data into dataframes, etc)  
`w` : this writes to a file - it will erase whatever was previously in the file  
`r+`: This is a special read and write method

Let's look at some examples!

In [24]:
#Let's first open a file
file_object = open('myfile.txt', "a")
#Now we call the 'write' method to enter text in our file
file_object.write('We are saving data in our file \n')
file_object.write('We are learning Python \n')
file_object.write('This has been a great tutorial! \n')
file_object.close()


- Let's now call the **"r"** method and see what we wrote!

In [25]:
#Let's open our file so we can read it!
file = open('myfile.txt', 'r')
print("file.read() '\n'", file.read(), '\n')

#If we only want to read one line, we can call the 'readline()' method
file = open('myfile.txt', 'r')
print("file.readlines()  '\n'", file.readlines(), '\n')

#Alternatively, we can loop over our object:
file = open("myfile.txt", "r") 
print("Looping over lines '\n'")
for line in file:
    print(line) 

file.close()

file.read() '
' We are saving data in our file 
We are learning Python 
This has been a great tutorial! 
 

file.readlines()  '
' ['We are saving data in our file \n', 'We are learning Python \n', 'This has been a great tutorial! \n'] 

Looping over lines '
'
We are saving data in our file 

We are learning Python 

This has been a great tutorial! 



- Lastly, we'll use the **"with"** method to open our text file and write something to it

In [26]:
with open('myfile.txt', 'w') as file:
    file.write("All the text will be gone except this sentence, since I used the 'w' method")

with open('myfile.txt', 'r') as file:
    print(file.readlines())

["All the text will be gone except this sentence, since I used the 'w' method"]


For a more complete synopsis of the "write" methods for Python data, take a look at this website:  
    http://www.pythonforbeginners.com/files/reading-and-writing-files-in-python   
I've found it useful and full of handy tips!
    

#### Thank you very much for reading through this tutorial! I hope it was helpful, and that you feel as though you can begin to do some basic Python programming. Please feel free to leave feedback and let me know what I could explain better (or more in depth)!