In [52]:
## Importing csv for summary of functions
import pandas as pd
p4k = pd.read_csv("p4kreviews.csv",encoding='latin1',index_col=0)

## pandas Overview 


Making the move from R to python, I feel out of place without my familiar tidyverse of packages for data maniuplation and visualization. As such, I've been spending a lot of time learning [pandas](https://pandas.pydata.org/), the most popular data analysis and manipulation tool in python. 

This post is meant to serve as an overview of pandas functionality as well as serve as a personal reference. To demonstrate pandas, I've chosen to use a [Kaggle dataset](https://www.kaggle.com/nolanbconaway/pitchfork-data) that compiles over 18k music reviews from the Pitchfork website. 

Below is a preview of the dataset which includes each album's score on a 10 point scale, artist name, album name, genre, review date, and text of the review. The best column refers to whether or not the album was designated a 'best new music' label.

In [53]:
p4k.head()

Unnamed: 0,album,artist,best,date,genre,review,score
1,A.M./Being There,Wilco,1,December 6 2017,Rock,Best new reissue 1 / 2 Albums Newly reissued a...,7.0
2,No Shame,Hopsin,0,December 6 2017,Rap,"On his corrosive fifth album, the rapper takes...",3.5
3,Material Control,Glassjaw,0,December 6 2017,Rock,"On their first album in 15 years, the Long Isl...",6.6
4,Weighing of the Heart,Nabihah Iqbal,0,December 6 2017,Pop/R&B,"On her debut LP, British producer Nabihah Iqba...",7.7
5,The Visitor,Neil Young / Promise of the Real,0,December 5 2017,Rock,"While still pointedly political, Neil Youngs ...",6.7




**Sections:**

    1. Series
    2. DataFrames
    3. Strings
    4. Multi-Index
    5. Group By
    6. Merging, Joining, Concatenating
    7. Dates and Times

**Some useful links:**

- [Official Pandas Documentation](https://pandas.pydata.org/)

- [Comparison with R/R libraries](https://pandas.pydata.org/pandas-docs/stable/getting_started/comparison/comparison_with_r.html?highlight=arrange)

- [On Method Chaining in pandas](https://towardsdatascience.com/the-unreasonable-effectiveness-of-method-chaining-in-pandas-15c2109e3c69)

### Subsetting 

**Selecting columns** by name is done by passing the column(s) quoted name into brackets. 

In [90]:
p4k['album']

1                   A.M./Being There
2                           No Shame
3                   Material Control
4              Weighing of the Heart
5                        The Visitor
                    ...             
19551                           1999
19552                 Let Us Replay!
19553    Singles Breaking Up, Vol. 1
19554                    Out of Tune
19555      Left for Dead in Malaysia
Name: album, Length: 19555, dtype: object

In [91]:
p4k[['album','artist']]

Unnamed: 0,album,artist
1,A.M./Being There,Wilco
2,No Shame,Hopsin
3,Material Control,Glassjaw
4,Weighing of the Heart,Nabihah Iqbal
5,The Visitor,Neil Young / Promise of the Real
...,...,...
19551,1999,Cassius
19552,Let Us Replay!,Coldcut
19553,"Singles Breaking Up, Vol. 1",Don Caballero
19554,Out of Tune,Mojave 3


A range of columns can also be selected using the colon (:)

In [92]:
p4k.loc[:,'artist':]

Unnamed: 0,artist,best,date,genre,review,score
1,Wilco,1,December 6 2017,Rock,Best new reissue 1 / 2 Albums Newly reissued a...,7.0
2,Hopsin,0,December 6 2017,Rap,"On his corrosive fifth album, the rapper takes...",3.5
3,Glassjaw,0,December 6 2017,Rock,"On their first album in 15 years, the Long Isl...",6.6
4,Nabihah Iqbal,0,December 6 2017,Pop/R&B,"On her debut LP, British producer Nabihah Iqba...",7.7
5,Neil Young / Promise of the Real,0,December 5 2017,Rock,"While still pointedly political, Neil Youngs ...",6.7
...,...,...,...,...,...,...
19551,Cassius,0,January 26 1999,Electronic,"Well, it's been two weeks now, and I guess it'...",4.8
19552,Coldcut,0,January 26 1999,Electronic,The marketing guys of yer average modern megac...,8.9
19553,Don Caballero,0,January 12 1999,Experimental,"Well, kids, I just went back and re-read my re...",7.2
19554,Mojave 3,0,January 12 1999,Rock,"Out of Tune is a Steve Martin album. Yes, I'll...",6.3


**Selecting rows** can be done with the .iloc() method which can be sliced with a colon (:)


In [93]:
# Returning the first row (as a pandas series)
p4k.iloc[0]

album                                      A.M./Being There
artist                                                Wilco
best                                                      1
date                                        December 6 2017
genre                                                  Rock
review    Best new reissue 1 / 2 Albums Newly reissued a...
score                                                     7
Name: 1, dtype: object

In [94]:
# Returning the first 10 rows (as a pandas dataframe)
p4k.iloc[0:9]

Unnamed: 0,album,artist,best,date,genre,review,score
1,A.M./Being There,Wilco,1,December 6 2017,Rock,Best new reissue 1 / 2 Albums Newly reissued a...,7.0
2,No Shame,Hopsin,0,December 6 2017,Rap,"On his corrosive fifth album, the rapper takes...",3.5
3,Material Control,Glassjaw,0,December 6 2017,Rock,"On their first album in 15 years, the Long Isl...",6.6
4,Weighing of the Heart,Nabihah Iqbal,0,December 6 2017,Pop/R&B,"On her debut LP, British producer Nabihah Iqba...",7.7
5,The Visitor,Neil Young / Promise of the Real,0,December 5 2017,Rock,"While still pointedly political, Neil Youngs ...",6.7
6,Perfect Angel,Minnie Riperton,1,December 5 2017,Pop/R&B,Best new reissue A deluxe reissue of Minnie Ri...,9.0
7,Everyday Is Christmas,Sia,0,December 5 2017,Pop/R&B,Sias shiny Christmas album feels inconsistent...,5.8
8,Zaytown Sorority Class of 2017,Zaytoven,0,December 5 2017,Rap,The prolific Atlanta producer enlists 17 women...,6.2
9,Songs of Experience,U2,0,December 4 2017,Rock,"Years in the making, U2s 14th studio album fi...",5.3


**Filtering** rows based off of some condition

In [2]:
p4k.head()

Unnamed: 0,album,artist,best,date,genre,review,score
1,A.M./Being There,Wilco,1,December 6 2017,Rock,Best new reissue 1 / 2 Albums Newly reissued a...,7.0
2,No Shame,Hopsin,0,December 6 2017,Rap,"On his corrosive fifth album, the rapper takes...",3.5
3,Material Control,Glassjaw,0,December 6 2017,Rock,"On their first album in 15 years, the Long Isl...",6.6
4,Weighing of the Heart,Nabihah Iqbal,0,December 6 2017,Pop/R&B,"On her debut LP, British producer Nabihah Iqba...",7.7
5,The Visitor,Neil Young / Promise of the Real,0,December 5 2017,Rock,"While still pointedly political, Neil Youngs ...",6.7


In [31]:
p4k.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
best,19555.0,0.053183,0.224405,0.0,0.0,0.0,0.0,1.0
score,19555.0,7.027446,1.277544,0.0,6.5,7.3,7.8,10.0


In [33]:
reviews = p4k['review']

    

In [36]:
ReviewWordCount = []

for i in range(1,len(reviews)):
    WordCount = len(reviews[i].split())
    #print(WordCount)
    ReviewWordCount.append(WordCount)

AttributeError: 'float' object has no attribute 'split'


### 1.Series

_A note on attributes and methods:_
An attribute is something that bound to an object, while a method is a procedure or action. Also, attributes have no parantheses, attributes require them
    
#### Series attributes and methods, explanations where necessary:
    
   - series.head
   - series.tail
   - len(series) - _Return length of series including NA/null observations_
   - sorted(series) - _Sorts values_
   - list(series) - _Converts series to a list_
   - dict(series) - _Turns the series into a dictionary object where the the existing index becomes the dictionary key_
   - min(series) -_For strings, will return first value sorted alphabetically_
   - max(series) - _For strings, will return last value sorted alphabetically _
   - series.values - _values attribute_
   - series.index - _values attribute_
   - series.dtype - _data type_
   - series.is_unique - _Returns unique values_
   - series.shape - _dimensions of series/dataframe_
   - series.size - _number of elements (rows*columns)_
   - series.count() - _Returns number of non-NA/null observations_
   - series.name - _name of the series_
   - series.sort_values(inplace=T) - _sorts values, inplace=T replaces original values with sorted ones_
   - series.sort_index(inplace=T) - _sorts index, inplace=T replaces original values with sorted ones_
   - "Value" in series 
   - series['n'] - _returns nth element by index_
   - series['index label'] - _returns element by index value name_
   - series.sum()
   - series.mean()
   - series.std()
   - series.min()
   - series.max()
   - series.median()
   - series.mode()
   - series.describle() - _Similar to summary() in R, returns key descriptive stats_
   - series.idmax() - _Return the row label of the maximum value._
   - series.idmin() - _Return the row label of the minimum value._
   - series.value_counts() - _Similar to table() in base R_
    
#### Apply method - invokes a function on a series of values

   - [Documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.apply.html)
   - series.apply(FUNCTION, args(,.. additional arguments)


#### lambda
   - [Explanation](https://stackabuse.com/lambda-functions-in-python/)
   - In Python, the lambda keyword declares an anonymous (no name) function, which are referred to as "lambda functions". Although syntactically they look different, lambda functions behave in the same way as regular functions that are declared using the def keyword.
    

#### .map()
    - .map()

In [27]:
range(12)

range(0, 12)

In [94]:
artists = p4k["artist"]
scores = p4k["score"]
p4k.head()

Unnamed: 0,album,artist,best,date,genre,review,score
1,A.M./Being There,Wilco,1,December 6 2017,Rock,Best new reissue 1 / 2 Albums Newly reissued a...,7.0
2,No Shame,Hopsin,0,December 6 2017,Rap,"On his corrosive fifth album, the rapper takes...",3.5
3,Material Control,Glassjaw,0,December 6 2017,Rock,"On their first album in 15 years, the Long Isl...",6.6
4,Weighing of the Heart,Nabihah Iqbal,0,December 6 2017,Pop/R&B,"On her debut LP, British producer Nabihah Iqba...",7.7
5,The Visitor,Neil Young / Promise of the Real,0,December 5 2017,Rock,"While still pointedly political, Neil Youngs ...",6.7


In [95]:
## Methods and Attributes

artists.count() 
artists.value_counts() 
artists.head
artists.tail
len(artists)
sorted(artists)  
list(artists) 
dict(artists) 
min(artists)  
max(artists) 
artists.values 
artists.index
artists.dtype
artists.is_unique 
artists.shape 
artists.size 
artists.name 
artists.sort_values()  
artists.sort_index() 
"David Bowie" in artists # in operator
artists[100] 
#artists['David Bowie'] 
scores.sum()
scores.mean()
scores.std()
scores.min()
scores.max()
scores.median()
scores.mode()
scores.describe() 
scores.idxmax("Score") 
scores.idxmin("Score") 

12223

In [96]:
## Apply Method - invokes a function on a series of values

# returns nth character of each artist name, with index starting at 0
def n_char(string,n):
    if len(string)<n+1:
        return ''
    else:
        return(string[n])
    
 
## Returning character from artist string at positiong 3: 
artists.apply(n_char, args=(3,))



1        c
2        s
3        s
4        i
5        l
        ..
19551    s
19552    d
19553     
19554    a
19555    l
Name: artist, Length: 19555, dtype: object

In [97]:
## Lambda - 
artists.apply(lambda x: x[0])

1        W
2        H
3        G
4        N
5        N
        ..
19551    C
19552    C
19553    D
19554    M
19555    N
Name: artist, Length: 19555, dtype: object

### 2. Data Frames

#### Basic Information
   - df.shape
   - df.dtypes
   - df.columns
   - df.axes
   - df.info
   - df.sum( , axes={1,0})
    
#### Selecting column(s)
    
   - df["c1"] or df.c1, df[["c1","c2"]]
    
#### Adding a new column
    
   - df["newCol"] = {value}
 
#### Broadcasting Operations
   - df[value].add(5) or df[value] + 5 (accounts for NAs)
   - df[value].mul(3) 

#### Dropping Rows with Null Values  
  
   - df.dropna() - _drops any observations with an NA values. Similar to R's complete.observations_
   - df.dropna(how="all") - _only drops rows with all NA values_
   - df.dropna(axis=1) - _drops columns with any NA values_
    
   - df.fillna(value=0) - _fills all values in the dataframe_
   - df["column1"].fillna(0,inplace=True) - _column by column approach_
    
    
#### Converting Types using as.type() method
   - df["Float_Score"].astype("int") - _converts FLOAT to INT. Note that there is not inplace arg_
   - as.type("category") - _can be used to convert a string to a R factor-like variable. Saves space._  
    
#### Sorting/Ranking Values
   - df.sort_values([Co1],[Col2], ascending=[True,False])
   - df.rank() - _provides rankings as integers_
    
#### Filtering based upon a condition
   - df["Col1"]=="Value" or df["Col1"]<=22 will return a boolean
   - df[df["Col1"]=="Value"] will return a filtered dataset
   - Alternatively, filter1 = df["Col1"]=="Value", df[filter1]
   - Conditions can be strung together with AND (&), OR (|)
    
#### .isin() Method

   - df["Col1].isin(["Value1","Value2"]) can be used to filter/extract rows in a dataframe
    
#### .isnull(), .notnull() Methods
   - df["Col1"].isnull() - _produces a boolean series where Col1 value is null_
   - df["Col1"].notnull() - _produces a boolean series where Col1 value is NOT null_

#### .between() Method
   - df["Col1"].between(200,300) - _returns a boolean series of observations falling between 200 and 300, inclusive. Works on times, dates, and numerics_
   
#### .duplicated() Method
   - df["Col1"].duplicate(keep="first") - _Idenifies duplicates and removes them, by default keeps the first observation. keep=False will return all observations that have duplicates_
   
#### .drop_duplicates() Method
   - df.drop_duplicate() - _Applies to a df across all columns, where as the .duplicated() method above applies to a series._
   - df.drop_duplicates(subset=["Col1"], keep = "first") - Can be applied to specific columns_
   
#### .unique() Method
   - df["Col1"].unique() - counts unique values for one column

#### .nunique() Method
   - df.nunique() - counts unique values across columns
 
#### .set_index() Method
   - df.set_index("Col1") - _replaces existing index with values from a column_
   
#### .reset_index() Method
   - df.reset_index(drop=True) - _resets index and drops values_
   - df.reset_index(drop=True) - _getting back to original_
   
#### Retrieving Rows by Index Label with .loc()
   - .loc uses brackets, parantheses 
   - df.loc["indexLabel"] - Retrieves row with specific index label
   
#### Retrieving Rows by Index Position with .loc()
   - df.iloc[100] - _retrieves row with specific index number(s)_
   - df.iloc[60:120] - _retrieving a range_
   - df.iloc[12,1:3] - _retrieving a certain row, multiple columns_

#### Identifying Individual cells, setting new values
   - df.iloc[0, 1] == "New Value" 
   - df.iloc[2, 0:] == "New row value"

#### Renaming Index Labels or Columns in a Dataframe
   - df.rename(columns = {"Col1" : "NewCol1", "Col2" : "NewCol2"}, inplace=T) - _renaming of columns are done with a dictionary 


#### Deleting Rows or Columns from a Dataframe
   - dr.drop["Row1", axis=0] - _drops a row by name_
   - df.drop("Col1", axis=1) - _drop a column_
   - del df["Col1"] - _alternative method_
   
#### Random Samples with .sample() Method
   - df.sample(n) - _sample n random rows_
   - df.sample(frac=0.25) - _samples a random 25%_
   - df.sample(axis=) - _can sample rows or cols with axis_
   
#### The .nsmallest() and .nlargest() Methods
   - df.nsmallest(n=3, columns="Col1 ) - _returns 3 smallest values for Col1_
   - df.nlargest(n=3, columns="Col1 ) - _returns 3 largest values for Col1_
   
   
#### Filtering with the .where() Method()
   - df.where(df[Col1]=="Value") - _returns the original data frame with NAS in rows that don't meet the filtering criteria_
   
#### The .query() Method
   - df.query('Col'=="Value") - _Similar to filter, only returns matching rows_
   

#### .copy() Method
   - df.copy() - _create a copy of the object's indices and data_
   
   

### 3. Strings

#### Common methods   
   - string.lower()
   - string.upper()
   - string.title() - _Capitalizes first letter of each word_
   - len(string)
   - string.strip() - _Strips white space_
   - string.lstrip() - _Strips white space on the left_
   - string.rstrip() - _Strips white space on the right_
   

#### .str.replace() method
   - "Hello world".replace("l","!") - _Two arguments: pattern, substitute_
   
   
#### Filtering with string methods   
   - df["Col1"].str.lower().str.contains("water") - _Searches Col1 for strings that contain 'water'_
   - Other alternate searches: str.startswith(), str.endswith()

#### Splitting strings by characters 
   - "Hello my name is Ravi".split(" ") # single arg is the delimiter/sep
   - **Expand parameter**: df[["First Name", "Last Name"]] = df["Name"].str.split(",", expand=True) - _Breaks apart Name into first and last name columns_
   - **n parameter** - _n equals the maximum number of splits_



### 4. Multi Index
   
#### Creating a multi-index with set_index() 
   - df.set_index(keys=["Col1","Col2"], inplace=True) - _Creates multi-level index_
    
#### The .get_level_values() Method
   - df.index.get_level_values() - _returns index values_

#### The .set_names() Method
   - df.index_set_names(["Name1","Name2"]) - _renames index levels_
   
#### The sort_index() Method
   - df.sort_index(ascending=[True,False]) - _sorts indexes_

#### The .transpose() method and MultiIndex on Column Level
   - dfT = df.transpose() - _transposes data.frame, including indexes_ 

#### The .swaplevel() Method
   - df.swaplevel() - _swamps levels of multi-index_

#### The .stack() Method
   - Similar to R's tidy::gather()
   - [Official documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.stack.html)

#### The .unstack() Method 
   - Similar to R's tidyr::spread()
   - 

#### The .pivot() Method
   - df.pivot(index="Col1", columns"Col2", values="Col3") - _Returns reshaped DataFrame organized by given index/column names_ 
   - [Official documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pivot.html)

#### The .pivot_table() Method
   - df.pivot_table(index="Col1", columns"Col2", values="Col3", aggfunc="mean") - _Create a spreadsheet-style pivot table as a DataFrame_
   - [Official documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.pivot_table.html)

#### The pd.melt() Method
   - Essentially the inverse of pivot_table. Converting into a longer table
   - [Official documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.melt.html)


### 5. Group by

#### The pd.melt() Method
   - 
   - 

### 6. Merging, Joining, Concatenating
   - 
   - 
   - 
   - 
   - 

### 7. Merging, Joining, Concatenating  
   - 
   - 
   - 
   - 
   - 