# Introduction to Pandas

### Importing

First we need to import the pandas package

In [1]:
import pandas as pd
%matplotlib inline 
# remember this from earlier? we call this to get plots to show up inline in the notebook

Let's read in a csv file called `sales.csv` and create a pandas dataframe

- Think of a dataframe as similar to a worksheet tab in Excel

In [2]:
df = pd.read_csv('data/sales.csv')

Reminder check: where does `pd.` come from? What are we doing with that piece of code?

The next thing we probably want to do is peak at the data:

In [3]:
df.head(15)

Unnamed: 0,TV,Radio,Newspaper,Sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,9.3
3,151.5,41.3,58.5,18.5
4,180.8,10.8,58.4,12.9
5,8.7,48.9,75.0,7.2
6,57.5,32.8,23.5,11.8
7,120.2,19.6,11.6,13.2
8,8.6,2.1,1.0,4.8
9,199.8,2.6,21.2,10.6


Why might we want to peak at our data?

### Getting information on our data

Before we start doing any analysis, we want to eyeball our dataset for any early issues:

In [4]:
df.describe()

Unnamed: 0,TV,Radio,Newspaper,Sales
count,200.0,200.0,200.0,200.0
mean,147.0425,23.264,30.554,14.0225
std,85.854236,14.846809,21.778621,5.217457
min,0.7,0.0,0.3,1.6
25%,74.375,9.975,12.75,10.375
50%,149.75,22.9,25.75,12.9
75%,218.825,36.525,45.1,17.4
max,296.4,49.6,114.0,27.0


In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 4 columns):
TV           200 non-null float64
Radio        200 non-null float64
Newspaper    200 non-null float64
Sales        200 non-null float64
dtypes: float64(4)
memory usage: 6.3 KB


### Selecting Data

- We may want to select particular columns or rows
- Hint: pandas dataframes work like dictionaries!

In [5]:
# we can call columns by name (as a string)
df['TV']

0      230.1
1       44.5
2       17.2
3      151.5
4      180.8
5        8.7
6       57.5
7      120.2
8        8.6
9      199.8
10      66.1
11     214.7
12      23.8
13      97.5
14     204.1
15     195.4
16      67.8
17     281.4
18      69.2
19     147.3
20     218.4
21     237.4
22      13.2
23     228.3
24      62.3
25     262.9
26     142.9
27     240.1
28     248.8
29      70.6
       ...  
170     50.0
171    164.5
172     19.6
173    168.4
174    222.4
175    276.9
176    248.4
177    170.2
178    276.7
179    165.6
180    156.6
181    218.5
182     56.2
183    287.6
184    253.8
185    205.0
186    139.5
187    191.1
188    286.0
189     18.7
190     39.5
191     75.5
192     17.2
193    166.8
194    149.7
195     38.2
196     94.2
197    177.0
198    283.6
199    232.1
Name: TV, dtype: float64

The numbers above in the left column are called the "index". Other software packages sometimes call this the row number or row name. 

- In Excel, rows have numbers and columns have letters by default.

In [6]:
df.head()

Unnamed: 0,TV,Radio,Newspaper,Sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,9.3
3,151.5,41.3,58.5,18.5
4,180.8,10.8,58.4,12.9


In [7]:
# use .iloc to call by position (row#, col#)
df.iloc[0,0]

230.09999999999999

In [8]:
# use .loc to call by name (rowname, colname)
df.loc[0, 'TV']

230.09999999999999

What if we wanted to call multiple columns?

In [9]:
df[['TV', 'Sales', 'Newspaper']].head()

Unnamed: 0,TV,Sales,Newspaper
0,230.1,22.1,69.2
1,44.5,10.4,45.1
2,17.2,9.3,69.3
3,151.5,18.5,58.5
4,180.8,12.9,58.4


What if we wanted to get all the column names?

In [10]:
df.columns

Index([u'TV', u'Radio', u'Newspaper', u'Sales'], dtype='object')

What data structure does the first element above look like? Can we change it?

In [11]:
df.columns = ['tv', 'radio', 'news', 'sales']

In [12]:
df.head()

Unnamed: 0,tv,radio,news,sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,9.3
3,151.5,41.3,58.5,18.5
4,180.8,10.8,58.4,12.9


In [13]:
df.sample(15)

Unnamed: 0,tv,radio,news,sales
175,276.9,48.9,41.8,27.0
63,102.7,29.6,8.4,14.0
75,16.9,43.7,89.4,8.7
98,289.7,42.3,51.2,25.4
137,273.7,28.9,59.7,20.8
81,239.8,4.1,36.9,12.3
65,69.0,9.3,0.9,9.3
0,230.1,37.8,69.2,22.1
189,18.7,12.1,23.4,6.7
87,110.7,40.6,63.2,16.0


What if we wanted to add new columns? Remember, dataframes work like dictionaries!

Ex: let's add a new column called total_spend

In [14]:
df['total_spend'] = df['tv'] + df['radio'] + df['news']

In [15]:
df.head()

Unnamed: 0,tv,radio,news,sales,total_spend
0,230.1,37.8,69.2,22.1,337.1
1,44.5,39.3,45.1,10.4,128.9
2,17.2,45.9,69.3,9.3,132.4
3,151.5,41.3,58.5,18.5,251.3
4,180.8,10.8,58.4,12.9,250.0


Now let's delete the total_spend column:

In [16]:
# axis=1 means drop from columns
df.drop('total_spend', axis=1, inplace=True)

In [17]:
df.head()

Unnamed: 0,tv,radio,news,sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,9.3
3,151.5,41.3,58.5,18.5
4,180.8,10.8,58.4,12.9


We can save our work to a csv with .to_csv()

In [None]:
df.to_csv('sales_new.csv')

# Group Exercise

- In a group of 4-5, create a pandas DataFrame with 4 columns: name, height(inches), gender, shoe_size using your classmates data
- Use .describe() to get the mean height and shoe_size
- Save the newly created dataframe to a csv called `test.csv`
- Add a column called hair_color with your group's hair color