# Pandas Tutorial -- CDS 2023


## 1. Setup


Name: Sharryl Seto

ID: 1005523

### Import

Before moving on to learn pandas first we need to install it and import it. If you install [Anaconda distributions](https://www.anaconda.com/) on your local machine or using [Google Colab](https://research.google.com/colaboratory) then pandas will already be available there, otherwise, you follow this installation process from [pandas official's website](https://pandas.pydata.org/docs/getting_started/install.html).

In [18]:
# Importing libraries
import numpy as np
import pandas as pd

## 2. Loading Different Data Formats Into a Pandas Data Frame




In [3]:
# given
city_names = pd.Series(['San Francisco', 'San Jose', 'Sacramento'])
population = pd.Series([852469, 1015785, 485199])
cities = pd.DataFrame({ 'City name': city_names, 'Population': population })
cities['Area square miles'] = pd.Series([46.87, 176.53, 97.92])

## 3. Data preprocessing
Data preprocessing is the process of making raw data to clean data. This is the most crucial part of data the science. In this section, we will explore data first then we remove unwanted columns, remove duplicates, handle missing data, etc. After this step, we get clean data from raw data.

### 3.1 Data Exploring

#### Retrieving rows from data frame.

In [4]:
# display first 3 rows
cities.head(3)


Unnamed: 0,City name,Population,Area square miles
0,San Francisco,852469,46.87
1,San Jose,1015785,176.53
2,Sacramento,485199,97.92


#### Retrieving information about dataframe 

In [9]:
cities.dtypes.value_counts()

object     1
int64      1
float64    1
Name: count, dtype: int64

#### Display number of rows and columns. 

In [10]:
cities.shape

(3, 3)

#### Adding a new column to a DataFrame



In [21]:
# Add a new column bool if city named after saint & greater than 50 square miles
cities['Bool'] = (cities['City name'].str.contains('San')) & (cities['Area square miles'] >50)

In [22]:
cities.head(3)

Unnamed: 0,City name,Population,Area square miles,Bool
0,San Francisco,852469,46.87,False
1,San Jose,1015785,176.53,True
2,Sacramento,485199,97.92,False


In [2]:
# exercise 2
# Importing libraries
import pandas as pd

In [3]:
# given
terms_1 = pd.Series(['this', 'is', 'a', 'sample'])
count_1 = pd.Series([1,1,2,1])
doc_1 = pd.DataFrame({ 'Term': terms_1, 'Count': count_1 })

terms_2 = pd.Series(['this', 'is', 'another', 'sample'])
count_2 = pd.Series([1,1,2,3])
doc_2 = pd.DataFrame({ 'Term': terms_2, 'Count': count_2 })


In [4]:
doc_1.head(4)

Unnamed: 0,Term,Count
0,this,1
1,is,1
2,a,2
3,sample,1


In [5]:
doc_2.head(4)

Unnamed: 0,Term,Count
0,this,1
1,is,1
2,another,2
3,sample,3


In [None]:
# calculate tf-idf score for each word in document d1 and d2. 
# The function you try to design is tfidf(w,d) where w can be any word in the vocabulary and d is either d1 and d2.

In [6]:
# Add a new column tf term freq
doc_1['Term Frequency'] = doc_1['Count']/doc_1['Count'].sum()
doc_1.head(4)

Unnamed: 0,Term,Count,Term Frequency
0,this,1,0.2
1,is,1,0.2
2,a,2,0.4
3,sample,1,0.2


In [7]:
# Add a new column tf term freq
doc_2['Term Frequency'] = doc_2['Count']/doc_2['Count'].sum()
doc_2.head(4)

Unnamed: 0,Term,Count,Term Frequency
0,this,1,0.142857
1,is,1,0.142857
2,another,2,0.285714
3,sample,3,0.428571


In [39]:
print(doc_2.loc[doc_2['Term'] == 'this', 'Term Frequency'].iloc[0] )

0.14285714285714285


In [44]:
print( np.log2(2/2))

0.0


In [15]:
print(doc_1['Term'].eq('this').any())

True


In [57]:
# write function tf-idf(w,d) = tf * log2 (2/ df of word)
def tf_idf_score(w, d):
    # check document frequency of word (how many times it appears in each doc)  
    print("calculating tf-idf score for the word: '", w, "' in ", d) 
    df = 0
    tf_idf = 0
    in_doc_1 = False
    in_doc_2 = False

    if doc_1['Term'].eq(w).any():
        in_doc_1 = True
    if doc_2['Term'].eq(w).any():
        in_doc_2 = True
    
    if in_doc_1 & in_doc_2:
        print("word appears in both documents") # not very helpful
        df = 2
    elif in_doc_1 | in_doc_2:
        df = 1
    else:
        df = 0
        print("word does not appear in either document")
        return 0
    
    # calculate score
    if (d == 'd1') & in_doc_1:
        tf_idf = doc_1.loc[doc_1['Term'] == w, 'Term Frequency'].iloc[0] * np.log2(2/df)
    elif (d == 'd2') & in_doc_2:
        tf_idf = doc_2.loc[doc_2['Term'] == w, 'Term Frequency'].iloc[0] * np.log2(2/df)
    else:
        print("word does not appear in this document")
        return 0
    print(tf_idf)
    return tf_idf

In [58]:
tf_idf_score('is','d2')

calculating tf-idf score for the word: ' is ' in  d2
word appears in both documents


0.0

## 7. Reference


1. [Pandas user guide](https://pandas.pydata.org/docs/user_guide/index.html)
2. [Pandas 1.x Cookbook](https://www.packtpub.com/product/pandas-1-x-cookbook-second-edition/9781839213106)
3. [The Data Wrangling Workshop](https://www.packtpub.com/product/the-data-wrangling-workshop-second-edition/9781839215001) 
4. [Python for Data Analysis](https://www.oreilly.com/library/view/python-for-data/9781449323592/)
5. [Data Analysis with Python: Zero to Pandas - Jovian YouTube Channel](https://www.youtube.com/watch?v=BaV4PRXYNIY&list=PLyMom0n-MBrpzC91Uo560S4VbsiLYtCwo)
6. [Best practices with pandas - Data School YouTube Channel](https://www.youtube.com/watch?v=hl-TGI4550M&list=PL5-da3qGB5IBITZj_dYSFqnd_15JgqwA6)
7. [Pandas Tutorials - Corey Schafer YouTube Channel](https://www.youtube.com/watch?v=ZyhVh-qRZPA&list=PL-osiE80TeTsWmV9i9c58mdDCSskIFdDS)
8. [Pandas Crosstab Explained](https://pbpython.com/pandas-crosstab.html)