In [9]:
##Imports

from pathlib import Path
import pandas as pd
import numpy as np

import cv2

## Config Variables 

DATA_DIR = Path('../data')
CONTENT_DIR = Path.home() / 'Datasets' / 'unpackAI' / 'DL201-4.0' # Change this for use in colab

startupDataPath = DATA_DIR / 'StartUpInvestments' / 'investments_VC.csv'

IMAGE_DIR = Path('../img')

In [10]:
# Loading the dataset
startUpData = pd.read_csv(startupDataPath,encoding = 'unicode_escape')

### Introduction

In this notebook, we are going to learn about indexing. Indexing is a central concept, which allows both us and the computer to access datapoints quickly  and easily. Well indexed data is very important in deep learnig because without it we cannot get started on accessing and manipulating any of our data. 


![Library](../img/library.jpg)

If you can't access each datapoint, then you can't expect a computer to be able to do it either. Before a ML algorithm can find relationships inside of data, it needs to be able to access those datapoints. Otherwise, it is like looking for a receipt buried under a a disheveled stack of papers.   

There are several ways to organize and conceptualize data, but the one covered in this course will be indexing. 

Indexing it is one of the more intuitive, yet powerful ways to access and manipulate data. Pandas as a data wrangling tool brings this to the next level.

There is nothing new about this, and everyone knows that indexes are a part of everyday life life. 

When you read a book, the page number is an index. 
A book also has an index called an ISBN, which allows libraries to indentify and categorize books.

Postal codes are also indices which allow for the efficient delivery of mail.

What are some other examples of indicies?  

# Section 1: Indexing Tabular Data

The pandas library shines in how it is able to index data. The brilliance of pandas is it's ability to assign human readable indicies on top of optimized scientific computing code.

It makes it simple, and straightforward, which is why this is covered first before getting into computer vision and NLP examples. 

One way it does this is making columns not assigned by numbers, but rather by names. Then, with each column, we can access it through this method. 

As you'll see below, Pandas Columns are essentially a layer of metadata, which is tied to each of the features which makes it very straightforward to access and manipulate the data assigned to them. 

![indextabs](../img/indextabs.jpg)

### Example one: Indexing Columns
<hr style="border:1px solid gray"> </hr>

In general, the first thing one does when loading a csv in pandas is check the following information. 

* First, reading the data.
* Second, checking the shape.
* Third, Looking at the columns 

Reading the data loads it onto the computer's system RAM so that it can be operated on by the CPU 

Checking the shape is a fundamental step that lets us know more about how our data looks like, and how it is going to be indexed

Then when we look at the columns, it is going to tell us what our features are in teh data and how it is going to be useful to us.

Let's take a look at this dataset, which contains data on startups, what market they are in, along with their funding and aquisition status. 

In [11]:
# loading the data
startUpData = pd.read_csv(startupDataPath,encoding = 'unicode_escape')

##### Step 1: Check the shape 


Whenever you load a dataframe, the first thing you need to do when you load data is to check the shape. It is good practice to make note of these two numbers because they are going to appear over and over again in your code. .

If this number is different than you were expecting, it will tell you right away that your data didn't load properly. The first number is the length, which tells us the number of samples. When you remove samples from your dataset, this number will change, and if it does your code might break. If you remove samples from your dataframe, it is good to do this sooner rather than later.


The width is the most important number to pay attention to in pandas. If you get an unexpected number, then it means that your data has not loaded correctly. If you remove or add a feature, this number is going to change.

It also needs to be paid attention to because this number will follow you around for the rest of the project. It will help you know  what is going in. 

In [13]:
startUpData.shape
# What does this mean?

(54294, 39)

##### Step 2: inspect the first few rows/samples/instances

The next thing that is very useful to do is to take a peek at your data. Take a look at it, do you see numbers? Text? Categories? Time data? 

Often, people use .head() as a way to do this, but sample is a good method to use because it might show you something you weren't expecting. 

In [13]:
startUpData.sample(5)

Unnamed: 0,permalink,name,homepage_url,category_list,market,funding_total_usd,status,country_code,state_code,region,...,secondary_market,product_crowdfunding,round_A,round_B,round_C,round_D,round_E,round_F,round_G,round_H
22429,/organization/kamibu,Kamibu,http://www.kamibu.com,|Games|,Games,70020,operating,,,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
45076,/organization/university-of-massachusetts-medi...,University of Massachusetts Medical School,http://www.umassmed.edu,,,9500000,operating,USA,MA,Worcester,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6497,/organization/browsarity,Browsarity,http://www.browsarity.com,|Venture Capital|Online Shopping|Charity|Softw...,Software,-,operating,,,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6382,/organization/bright-computing,Bright Computing,http://www.brightcomputing.com,|Software|,Software,14500000,operating,USA,CA,SF Bay Area,...,0.0,0.0,0.0,14500000.0,0.0,0.0,0.0,0.0,0.0,0.0
28641,/organization/neoguide-systems,NeoGuide Systems,http://www.neoguidesystems.com,|Biotechnology|,Biotechnology,25000000,acquired,USA,CA,SF Bay Area,...,0.0,0.0,0.0,0.0,25000000.0,0.0,0.0,0.0,0.0,0.0


The columns in this dataframe are our features

Which columns are you interested in? What do you want to know more about? 

The .columns method allows us to get the names of columns in our dataframe 

This is a good step because sometimes the column names are messy and might not be as they originally appear. 
Oftentimes, you may need to  rename columns. However, this dataset is rather clean, so we can wait until the skills book to cover how to do this


For example the column for 'market' has a typo. Can you find it? Are there any others? 

In [14]:
print(startUpData.columns)

Index(['permalink', 'name', 'homepage_url', 'category_list', ' market ',
       ' funding_total_usd ', 'status', 'country_code', 'state_code', 'region',
       'city', 'funding_rounds', 'founded_at', 'founded_month',
       'founded_quarter', 'founded_year', 'first_funding_at',
       'last_funding_at', 'seed', 'venture', 'equity_crowdfunding',
       'undisclosed', 'convertible_note', 'debt_financing', 'angel', 'grant',
       'private_equity', 'post_ipo_equity', 'post_ipo_debt',
       'secondary_market', 'product_crowdfunding', 'round_A', 'round_B',
       'round_C', 'round_D', 'round_E', 'round_F', 'round_G', 'round_H'],
      dtype='object')


In [15]:
columnSeries = pd.Series(startUpData.columns)
columnSeries = columnSeries.str.strip()
startUpData.columns = columnSeries

We can access columns in our dataframe by passing the name inside of square brackets

startUpData['name']

If the name doesn't have spaces, we can also access it with a period 

startUpData.name

In [None]:
startUpData['name'] #startUpData.name

We can also call multiple columns passing a list of column names



In [None]:
startUpData[['name','city']].head()

### Example two: Indexing Rows 
<hr style="border:1px solid gray"> </hr>

![TensorIndex](../img/1_3DTensor.jpg)
Source: https://www.surajx.in/

Congratulation, we now know how to index columns of tabular data. 


#### Indexing a single row

Now, let's try to get rows. In AI fields, rows are reffered to as samples or instances. 

This can be done with the .loc and .iloc methods

If we want to access exactly one sample, we can input it's index into the .iloc method 

In [17]:
#startUpData.iloc[0]
startUpData.iloc[4096]

permalink               /organization/axialmarket
name                                        Axial
homepage_url                 http://www.axial.net
category_list                          |Software|
 market                                 Software 
 funding_total_usd                     85,00,000 
status                                  operating
country_code                                  USA
state_code                                     NY
region                              New York City
city                                     New York
funding_rounds                                3.0
founded_at                             2009-01-01
founded_month                             2009-01
founded_quarter                           2009-Q1
founded_year                               2009.0
first_funding_at                       2010-11-05
last_funding_at                        2014-08-12
seed                                    2000000.0
venture                                 6500000.0


#### Indexing Multiple rows

Since data in Vector, Series, or Matrix form consists of more than one value organized together, we will need to be able to access multiple rows.

We can do that by telling the code to select a range of numbers. 

The first number before the colon (:) is the lower bound
The second number is the upper bound.

In [18]:
startUpData.iloc[0:5]

Unnamed: 0,permalink,name,homepage_url,category_list,market,funding_total_usd,status,country_code,state_code,region,...,secondary_market,product_crowdfunding,round_A,round_B,round_C,round_D,round_E,round_F,round_G,round_H
0,/organization/waywire,#waywire,http://www.waywire.com,|Entertainment|Politics|Social Media|News|,News,1750000,acquired,USA,NY,New York City,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,/organization/tv-communications,&TV Communications,http://enjoyandtv.com,|Games|,Games,4000000,operating,USA,CA,Los Angeles,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,/organization/rock-your-paper,'Rock' Your Paper,http://www.rockyourpaper.org,|Publishing|Education|,Publishing,40000,operating,EST,,Tallinn,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,/organization/in-touch-network,(In)Touch Network,http://www.InTouchNetwork.com,|Electronics|Guides|Coffee|Restaurants|Music|i...,Electronics,1500000,operating,GBR,,London,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,/organization/r-ranch-and-mine,-R- Ranch and Mine,,|Tourism|Entertainment|Games|,Tourism,60000,operating,USA,TX,Dallas,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Warm reminder that python starts counting with the number 0 instead of 1. This custom goes back to the early days of computing when memory was a huge constraint. 1-9 is just 9 numbers in a set, while 0-9 has ten numbers in the set. Leaving the 0 meant wasting allocated space. Now this is not a concern, but the custom has remained.

In [19]:
# Why is the shape of this data special and what does it tell us about the colon operator?

startUpData.iloc[:].shape # 

(54294, 39)

Giving negative numbers will tell pandas to give you the last rows in the dataframe. In this case, it is a good idea to check the back as well because these rows are completely empty. These will have to be dropped later

In [20]:
startUpData.iloc[-5:]

Unnamed: 0,permalink,name,homepage_url,category_list,market,funding_total_usd,status,country_code,state_code,region,...,secondary_market,product_crowdfunding,round_A,round_B,round_C,round_D,round_E,round_F,round_G,round_H
54289,,,,,,,,,,,...,,,,,,,,,,
54290,,,,,,,,,,,...,,,,,,,,,,
54291,,,,,,,,,,,...,,,,,,,,,,
54292,,,,,,,,,,,...,,,,,,,,,,
54293,,,,,,,,,,,...,,,,,,,,,,


#### Accessing data by row and column

If you can access data by column, and you can access it by row, you can do both at the same time

In the example below, we can extract the name and region column, then index rows out of this column

In [21]:
startUpData[['name','region']].iloc[0:10]

Unnamed: 0,name,region
0,#waywire,New York City
1,&TV Communications,Los Angeles
2,'Rock' Your Paper,Tallinn
3,(In)Touch Network,London
4,-R- Ranch and Mine,Dallas
5,.Club Domains,Ft. Lauderdale
6,.Fox Networks,Buenos Aires
7,0-6.com,
8,004 Technologies,"Springfield, Illinois"
9,01Games Technology,Hong Kong


Now, that we can access data by the index, we are on the first step 

### Example 3: Querying/Searching rows in Pandas
<hr style="border:1px solid gray"> </hr>

![Database](../img/dataserver.jpg)

A very useful skill is to query a dataframe to extract data out of it. Very rarely will we use pandas to query specific rows by their index location. But now that we understand this principle we can go further into the notebook.

Rather, what we can do is give pandas a command which asks it to return samples which meet a condition.

This is powerful because it allows us to cut through huge amounts of data.


In this, we are going to find instances of our dataframe which meet the conditions we can give it.

Here is an example, where we can get all the instances of the dataframe where the startup's region is 'Los Angeles'

In [17]:
startUpData[startUpData['region'] == 'Los Angeles']

Unnamed: 0,permalink,name,homepage_url,category_list,market,funding_total_usd,status,country_code,state_code,region,...,secondary_market,product_crowdfunding,round_A,round_B,round_C,round_D,round_E,round_F,round_G,round_H
1,/organization/tv-communications,&TV Communications,http://enjoyandtv.com,|Games|,Games,4000000,operating,USA,CA,Los Angeles,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12,/organization/1-800-dentist,1-800-DENTIST,http://www.1800dentist.com,|Health and Wellness|,Health and Wellness,-,operating,USA,CA,Los Angeles,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50,/organization/12society,12Society,http://www.12Society.com,|E-Commerce|,E-Commerce,-,acquired,USA,CA,Los Angeles,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
55,/organization/140fire,140Fire,http://140fire.com,|Entertainment|Sports|Real Time|Social Media|V...,Entertainment,500000,operating,USA,CA,Los Angeles,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
85,/organization/1rp-media,1RP Media,,,,-,operating,USA,CA,Los Angeles,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49271,/organization/zomazz,Zomazz,http://www.zomazz.com,|Curated Web|,Curated Web,2040342,operating,USA,CA,Los Angeles,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
49315,/organization/zoondy,Zoondy,http://zoondy.com,|Curated Web|,Curated Web,75000,operating,USA,CA,Los Angeles,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
49343,/organization/zqgame,ZQGame,http://zqgame.com,|Games|,Games,4220018,operating,USA,CA,Los Angeles,...,0.0,0.0,4220018.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
49367,/organization/zuma-ventures,Zuma Ventures,http://www.zuma.ventures,|Product Development Services|Marketplaces|Tec...,Technology,100000,operating,USA,CA,Los Angeles,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Discussion Time: 

It is crucial to understand what this line of code is doing. To do so, let's go back to the indexing example and run the code that goes inside of the first bracket



In [18]:
exampleSeries = startUpData['region'] == 'Los Angeles'
exampleSeries

0        False
1         True
2        False
3        False
4        False
         ...  
54289    False
54290    False
54291    False
54292    False
54293    False
Name: region, Length: 54294, dtype: bool

What this does is give us an index which tells pandas which instances that we want. 

As you can see, they both are the same length, and therefore are compatible.



In [19]:
exampleSeries.shape


(54294,)

In [21]:

exampleSeries.shape[0] == startUpData.shape[0]

True

See what kind of data you can get out of this data using this technique

In [22]:
startUpData[exampleSeries].head()

Unnamed: 0,permalink,name,homepage_url,category_list,market,funding_total_usd,status,country_code,state_code,region,...,secondary_market,product_crowdfunding,round_A,round_B,round_C,round_D,round_E,round_F,round_G,round_H
1,/organization/tv-communications,&TV Communications,http://enjoyandtv.com,|Games|,Games,4000000,operating,USA,CA,Los Angeles,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12,/organization/1-800-dentist,1-800-DENTIST,http://www.1800dentist.com,|Health and Wellness|,Health and Wellness,-,operating,USA,CA,Los Angeles,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50,/organization/12society,12Society,http://www.12Society.com,|E-Commerce|,E-Commerce,-,acquired,USA,CA,Los Angeles,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
55,/organization/140fire,140Fire,http://140fire.com,|Entertainment|Sports|Real Time|Social Media|V...,Entertainment,500000,operating,USA,CA,Los Angeles,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
85,/organization/1rp-media,1RP Media,,,,-,operating,USA,CA,Los Angeles,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Section 1: Part 2 (Indexing in Computer Vision)
<hr style="border:2px solid gray"> </hr>

### Introduction section 
<hr style="border:1px solid gray"> </hr>


Let's do some exercises to show how we can crop an image. 


### CV Indexing: example 1 (Resizing and Cropping an Image)
<hr style="border:1px solid gray"> </hr>

Indexing is relatively straightforward with images. Every pixel on your screen has an x and a y cordinate.  Likewise, images have this as well. Knowing what the dimensions of our image are, allow us to resize and crop them. This is usually done programmatically, however, this is a good opportunity to play with the raw numbers to get a better understanding of it. 

In [4]:
# Option to upload
#colab.upload()

sampleImagePath = IMAGE_DIR / 'highway.jpg'



In [5]:
img = cv2.imread(str(sampleImagePath))

In [27]:
type(img)

numpy.ndarray

In [28]:
img.shape

(640, 960, 3)

In [None]:
# Colab has difference here

cv2.imshow('Highway',img)
cv2.waitKey(0)
#If you have a problem, you can use this

In [None]:
cv2.destroyAllWindows() #close the image window

In [6]:
img = cv2.resize(img,(0,0),fx=0.5,fy=0.5) # resizes the image by percentage
print(img.shape)

(320, 480, 3)


In [None]:
cv2.imshow('Highway',img)
cv2.waitKey(0)

### CV Indexing Example 2: Labeling Images with openCV
<hr style="border:1px solid gray"> </hr>

 “Peace is a daily, a weekly, a monthly process, gradually changing opinions, slowly eroding old barriers, quietly building new structures. And however undramatic the pursuit of peace, that pursuit must go on.”

#### Creating a boxed label

In this section, we're going to add a rectangle to our images that outline where someting is located in the image.

Most often this is done by the AI model, but let's take a look at how it can be done under the hood

In [7]:
x1 = 250
y1 = 150
corner1 = (x1,y1)


x2 = 320
y2 = 220
corner2 = (x2,y2)


color = (255,0,0) #red  #RGB Red Green Blue  
frameWidth = 3 #pixels

labeled_img = cv2.rectangle(img,corner1,corner2,color,frameWidth)

In [None]:
cv2.imshow('Highway',img)
cv2.waitKey(0)
cv2.destroyAllWindows() #close the image window