# Comics Rx
## [A comic book recommendation system](https://github.com/MangrobanGit/comics_rx)
<img src="https://images.unsplash.com/photo-1514329926535-7f6dbfbfb114?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=2850&q=80" width="400" align='left'>

---

# Part 1: Get and Prep Data

## Import Arcane transaction data

### Using **Google Sheets**

The data is currently stored on Google drive, and is readable as a Google Sheet. I'm going to try to use a Google Sheet API to get into Pandas so I don't have to worry about it traversing across my local machine everytime I need to re-import it.

In [1]:
import pandas as pd # dataframes

from gspread_pandas import Spread, Client # gsheets interaction
import gspread_pandas

In [2]:
#gspread_pandas.conf.get_config()

#### Work through `gspread_pandas` Example

In [3]:
file_name = "http://stats.idre.ucla.edu/stat/data/binary.csv"
df = pd.read_csv(file_name)

In [4]:
df.head()

Unnamed: 0,admit,gre,gpa,rank
0,0,380,3.61,3
1,1,660,3.67,3
2,1,800,4.0,1
3,1,640,3.19,4
4,0,520,2.93,4


In [5]:
spread = Spread('https://docs.google.com/spreadsheets/d/19_AEcTEwHXe7LS-U1scZIHLqP0iF7W4UYMiSlBUE5bs/edit#gid=0')

In [6]:
spread.sheets

[<Worksheet 'Sheet1' id:0>]

In [7]:
spread.url


'https://docs.google.com/spreadsheets/d/19_AEcTEwHXe7LS-U1scZIHLqP0iF7W4UYMiSlBUE5bs'

In [8]:
spread.open_sheet(0)

In [9]:
spread

<gspread_pandas.client.Spread - 'User: 'werlindo.mangrobang@gmail.com', Spread: 'supertest', Sheet: 'Sheet1''>

In [10]:
deef = spread.sheet_to_df(index=None)
deef

Unnamed: 0,field,nm
0,343,dan
1,44,phil


**Sweet!** That example worked. I was able to download a Google Sheet into this Juypter Notebook as a `pandas dataframe`.  

Let's try to get the dataset!

**Instantiate spreadsheet object**

In [11]:
sheet_url = 'https://docs.google.com/spreadsheets/d/1IznAOevvBbV0k3OKPImUMUESSwWXoOaASngPJUepnlU'

trans = Spread(sheet_url)

**Double check details of the spreadsheet  to make sure we are looking at the correct one.**

In [12]:
print(trans.sheets)

print(trans.url)


[<Worksheet 'Copy of Detailed Sales Report.tab' id:1615090597>]
https://docs.google.com/spreadsheets/d/1IznAOevvBbV0k3OKPImUMUESSwWXoOaASngPJUepnlU


**Open the sheet and dump into a Pandas dataframe**

In [13]:
trans.open_sheet(0)

In [14]:
trans_df = trans.sheet_to_df(index=None, start_row=3)
trans_df.head()

Unnamed: 0,;Department,Category,Item,Description,Qty Sold,Date Sold,Account #
0,Overall,,,,529020,,
1,New Comics,,,,529020,,
2,New Comics,Amaze Ink Slave Labor Graphics,DCD151935,Filler Bunny #2,1,8/14/2011 6:01:03 PM,174.0
3,New Comics,Amaze Ink Slave Labor Graphics,DCD341726,Gargoyles #6,1,6/22/2012 2:11:37 PM,593.0
4,New Comics,Amaze Ink Slave Labor Graphics,DCD416182,Royal Historian of Oz #1,1,7/21/2010 2:03:07 PM,226.0


In [15]:
trans_df.tail()

Unnamed: 0,;Department,Category,Item,Description,Qty Sold,Date Sold,Account #
494700,New Comics,Zenescope Entertainment,DCDL071490,Van Helsing Vs Robyn Hood #2 (,4,1/13/2019 1:05:52 PM,1132
494701,New Comics,Zenescope Entertainment,DCDL062795,Van Helsing Vs the Werewolf #5,4,1/13/2019 1:05:52 PM,1132
494702,New Comics,Zenescope Entertainment,DCDL062795,Van Helsing Vs the Werewolf #5,2,11/7/2018 8:32:23 PM,1132
494703,New Comics,Zenescope Entertainment,DCDL062795,Van Helsing Vs the Werewolf #5,1,2/17/2019 2:01:42 PM,1132
494704,New Comics,Zenescope Entertainment,DCDL109793,Zodiac #1 Cvr E Colapietro,1,4/8/2019 12:30:34 PM,1132


We started at row 3 because I could see that the 'actual' headers didn't start until the third row. We can see the first two rows are likely just summary rows; we can take care of those down below.

Seems good so far. Let's list some tasks we want to accomplish, just from inspecting the rows above:

 - Standardize column headers
 - Make sure date sold is a `date`
 - Change `Account #` to a string?
 - Is `;Department` all the same value? In which case we can probably just drop it.

Let's get started. I will save a copy first so I won't have to keep re-importing it from Google Sheets (in case we make a mistake).

In [16]:
trans_df_orig = trans_df.copy()

**Drop rows without account numbers.**  
This should eliminate the superfluous summary rows.

In [22]:
trans_df = trans_df.loc[trans_df['Account #']!='',:].copy()

Check the values of `;Department`

In [23]:
trans_df[';Department'].unique()

array(['New Comics'], dtype=object)

They're all the same. 

**Let's drop that `;Department` column.**

In [29]:
trans_df.drop([';Department'], axis=1, inplace=True)

In [30]:
trans_df.head()

Unnamed: 0,Category,Item,Description,Qty Sold,Date Sold,Account #
2,Amaze Ink Slave Labor Graphics,DCD151935,Filler Bunny #2,1,8/14/2011 6:01:03 PM,174
3,Amaze Ink Slave Labor Graphics,DCD341726,Gargoyles #6,1,6/22/2012 2:11:37 PM,593
4,Amaze Ink Slave Labor Graphics,DCD416182,Royal Historian of Oz #1,1,7/21/2010 2:03:07 PM,226
5,Amaze Ink Slave Labor Graphics,DCD416182,Royal Historian of Oz #1,1,7/14/2010 7:49:40 PM,399
6,Amaze Ink Slave Labor Graphics,DCD416182,Royal Historian of Oz #1,1,7/19/2010 10:39:04 AM,237


The rest of the columns look useful. Now's a good time to change the column headers to a more standard format.

`Category` looks like Publisher. Let's get the lay of the land.

In [31]:
trans_df['Category'].value_counts()

Marvel Comics                     163423
DC Comics                         121173
Image Comics                       91574
Dark Horse                         26549
Other                              19597
IDW Publishing                     17725
DC Vertigo                         16195
Boom! Studios                      12164
Oni Press                           8022
D.E.                                6220
Avatar Press                        5386
Archie Comics                       1952
Zenescope Entertainment             1176
Image Topcow                        1007
Bongo Comics                         721
DC Wildstorm                         538
Red 5 Comics                         472
Fantagraphics                        332
Aspen MLT                            299
Radical Publishing                   112
Drawn & Quarterly                     44
Amaze Ink Slave Labor Graphics        11
D.D.P.                                10
Top Shelf Productions                  1
Name: Category, 

It looks like it's basically the publisher fo the comic. This is likely the context of Category relative to 
`;Department` of `New Comics` that we dropped earlier.

Let's look at those headers again.

In [33]:
trans_df.head(1)

Unnamed: 0,Category,Item,Description,Qty Sold,Date Sold,Account #
2,Amaze Ink Slave Labor Graphics,DCD151935,Filler Bunny #2,1,8/14/2011 6:01:03 PM,174


**Assign new column names.**

In [34]:
# Create list of new column names
col_names = ['publisher', 'item_id', 'title_and_num', 'qty_sold', 'date_sold', 'account_num']

In [35]:
trans_df.columns = col_names

In [36]:
trans_df.head(1)

Unnamed: 0,publisher,item_id,title_and_num,qty_sold,date_sold,account_num
2,Amaze Ink Slave Labor Graphics,DCD151935,Filler Bunny #2,1,8/14/2011 6:01:03 PM,174
