### Series in pandas

Series are one-dimensional arrays.

In [1]:
import pandas as pd

In [2]:
#create a Series from a list
l = [0,1,2,3,4,5]
ser = pd.Series(l)
ser

0    0
1    1
2    2
3    3
4    4
5    5
dtype: int64

In [3]:
#some additional info
print(type(ser))
print(ser.shape)

<class 'pandas.core.series.Series'>
(6,)


In [4]:
rangeSeries = pd.Series(range(-3, 4))
print(rangeSeries)

0   -3
1   -2
2   -1
3    0
4    1
5    2
6    3
dtype: int64


In [5]:
#this sets the indexes to the strings defined
names = ['NYT', 'WP', 'LAT', 'CNN', 'BBC', 'TG']
ser = pd.Series(l, index = names)
ser

NYT    0
WP     1
LAT    2
CNN    3
BBC    4
TG     5
dtype: int64

In [6]:
#looking up data becomes easier now
print(ser['LAT'])

2


In [7]:
a = ser > 3
print(a)

NYT    False
WP     False
LAT    False
CNN    False
BBC     True
TG      True
dtype: bool


In [8]:
#what's going on here?
print (ser[a])

BBC    4
TG     5
dtype: int64


### From dictionary to Series

This can be done simply with the following code, note that the keys in the dictionary become the index of the data:

In [9]:
# simple dictionary
data = {'NYT':0, 'WP':1, 'LAT':2, 'CNN':3, 'BBC':4, 'TG':5}

# Convert the dictionary into a pd.Series, and view it
media = pd.Series(data)
media

NYT    0
WP     1
LAT    2
CNN    3
BBC    4
TG     5
dtype: int64

In [10]:
media.index = ['New York Times', 'Washington Post', 'Los Angeles Times', 'Cable News Network', 'British Broadcasting Company', 'The Guardian']

In [11]:
media

New York Times                  0
Washington Post                 1
Los Angeles Times               2
Cable News Network              3
British Broadcasting Company    4
The Guardian                    5
dtype: int64

### Dataframes in pandas

Dataframes can be treated like tables and there is a number of ways to create them:


In [12]:
#if the lists are of equal length
data = {'medium': ['NYT', 'WP', 'LAT', 'CNN', 'BBC', 'TG'], 
        'articles': [2000, 8000, 3000, 500, 12000, 1000], 
        'reporters': [25, 76, 30, 10, 100, 15]}#number are fictional
df = pd.DataFrame(data)
df

Unnamed: 0,medium,articles,reporters
0,NYT,2000,25
1,WP,8000,76
2,LAT,3000,30
3,CNN,500,10
4,BBC,12000,100
5,TG,1000,15


If you have access to the dictionary, the column ordering can be defined:

In [13]:
df_new = pd.DataFrame(data, columns=['reporters', 'articles', 'medium'])
df_new

Unnamed: 0,reporters,articles,medium
0,25,2000,NYT
1,76,8000,WP
2,30,3000,LAT
3,10,500,CNN
4,100,12000,BBC
5,15,1000,TG


In [14]:
#adding a new column
df_new['long name'] = ['New York Times', 'Washington Post', 'Los Angeles Times', 'Cable News Network', 'British Broadcasting Company', 'The Guardian']

In [15]:
#deleting a column
del df_new['medium']
df_new

Unnamed: 0,reporters,articles,long name
0,25,2000,New York Times
1,76,8000,Washington Post
2,30,3000,Los Angeles Times
3,10,500,Cable News Network
4,100,12000,British Broadcasting Company
5,15,1000,The Guardian


In [16]:
#transpose
df_new.T

Unnamed: 0,0,1,2,3,4,5
reporters,25,76,30,10,100,15
articles,2000,8000,3000,500,12000,1000
long name,New York Times,Washington Post,Los Angeles Times,Cable News Network,British Broadcasting Company,The Guardian


In [17]:
df_new.T.T

Unnamed: 0,reporters,articles,long name
0,25,2000,New York Times
1,76,8000,Washington Post
2,30,3000,Los Angeles Times
3,10,500,Cable News Network
4,100,12000,British Broadcasting Company
5,15,1000,The Guardian


### Indexing in Pandas

The __iloc__ function is useful to slice parts from a DataFrame. It behaves differently, depending on whether we pass ranges or index elements directly.

In [18]:
df_new

Unnamed: 0,reporters,articles,long name
0,25,2000,New York Times
1,76,8000,Washington Post
2,30,3000,Los Angeles Times
3,10,500,Cable News Network
4,100,12000,British Broadcasting Company
5,15,1000,The Guardian


In [19]:
res = df_new.iloc[-2:]
res

Unnamed: 0,reporters,articles,long name
4,100,12000,British Broadcasting Company
5,15,1000,The Guardian


In [20]:
res = df_new.iloc[0,2]
res

'New York Times'

As we can see, passing the range __[0:n]__ extracts the first n-1 rows and using __[0,2]__ accesses the third element in the  first row. 

<div class="alert alert-block alert-warning">
<b>Note:</b> Notice the difference in data types - in the former case we retrieve a DataFrame as a result while in the latter case we have a string!
</div>

We can also use this to change values in the DataFrame directly:

In [21]:
df_new.iloc[0,2] = 'The New York Times'
df_new

Unnamed: 0,reporters,articles,long name
0,25,2000,The New York Times
1,76,8000,Washington Post
2,30,3000,Los Angeles Times
3,10,500,Cable News Network
4,100,12000,British Broadcasting Company
5,15,1000,The Guardian


In [22]:
#df_new.loc['2000']

### Extracting Series from Pandas DataFrames

Using __iloc__ we are able to extract Series from DataFrames. The following two cells demontrate this effect:

In [23]:
ser = df_new.iloc[:,2]
ser

0              The New York Times
1                 Washington Post
2               Los Angeles Times
3              Cable News Network
4    British Broadcasting Company
5                    The Guardian
Name: long name, dtype: object

In [24]:
ser2 = df_new.iloc[1:4,2]
ser2

1       Washington Post
2     Los Angeles Times
3    Cable News Network
Name: long name, dtype: object

We can extract a row from a DataFrame using the __loc__ function and passing along a label which needs to match some value in the first column of the DataFrame:

In [25]:
res_1 = df_new.loc[1]
res_1

reporters                 76
articles                8000
long name    Washington Post
Name: 1, dtype: object

This results in a Series and can be used to extract new values, e.g.:

In [26]:
res_1['reporters']

76

Passing along a list of values works too, however this results in a DataFrame: 

In [27]:
res_2 = df_new.loc[[1,2]]
res_2

Unnamed: 0,reporters,articles,long name
1,76,8000,Washington Post
2,30,3000,Los Angeles Times


We can also pass ranges to __loc__:

In [28]:
resdf = df_new.iloc[1:4]

In [29]:
resdf

Unnamed: 0,reporters,articles,long name
1,76,8000,Washington Post
2,30,3000,Los Angeles Times
3,10,500,Cable News Network


In [30]:
res_3 = df_new.loc[1:4]
res_3

Unnamed: 0,reporters,articles,long name
1,76,8000,Washington Post
2,30,3000,Los Angeles Times
3,10,500,Cable News Network
4,100,12000,British Broadcasting Company


And this also works for the columns:

In [31]:
res_4 = df_new.loc[1:4, 'reporters':'articles']
res_4

Unnamed: 0,reporters,articles
1,76,8000
2,30,3000
3,10,500
4,100,12000


Passing a single element as the list parameter results in a DataFrame:

In [32]:
res_5 = df_new.loc[[1]]
res_5

Unnamed: 0,reporters,articles,long name
1,76,8000,Washington Post


Sometimes, we want to change the index of our DataFrame - this can be done like this:

In [33]:
res_3.set_index('long name', inplace=True)
res_3.loc['Washington Post':]#todo

Unnamed: 0_level_0,reporters,articles
long name,Unnamed: 1_level_1,Unnamed: 2_level_1
Washington Post,76,8000
Los Angeles Times,30,3000
Cable News Network,10,500
British Broadcasting Company,100,12000


In [34]:
res_3

Unnamed: 0_level_0,reporters,articles
long name,Unnamed: 1_level_1,Unnamed: 2_level_1
Washington Post,76,8000
Los Angeles Times,30,3000
Cable News Network,10,500
British Broadcasting Company,100,12000


### Logical indexing in DataFrames

With the help of [logical indexing](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html) we can retrieve data from our DataFrames more easily. E.g. we can pass arrays or Series of True/False values to the __.loc__ indexer to select those values where the Series resolves in __True__. 

In [35]:
#let's create some larger artificial data for the sake of simplicity
datalarge = {'medium': ['NYT', 'WP', 'LAT', 'CNN', 'BBC', 'TG','NYT', 'WP', 'LAT', 'CNN', 'BBC', 'TG'], 
        'articles': [2000, 8000, 3000, 500, 12000, 1000, 3000, 8000, 5000, 500, 1000, 1000], 
        'reporters': [25, 76, 30, 10, 100, 15, 26, 71, 10, 10, 101, 15],
        'sections': [13, 25, 23, 22, 12, 3, 6, 10, 23 ,19, 18, 6], 
        'articles_per_week': [200, 250, 190, 222, 120, 300, 506, 116, 213 ,119, 418, 61], 
        }#number are fictional
dflarge = pd.DataFrame(datalarge)
dflarge

Unnamed: 0,medium,articles,reporters,sections,articles_per_week
0,NYT,2000,25,13,200
1,WP,8000,76,25,250
2,LAT,3000,30,23,190
3,CNN,500,10,22,222
4,BBC,12000,100,12,120
5,TG,1000,15,3,300
6,NYT,3000,26,6,506
7,WP,8000,71,10,116
8,LAT,5000,10,23,213
9,CNN,500,10,19,119


In [36]:
 dflarge.loc[dflarge['medium'] == 'NYT']

Unnamed: 0,medium,articles,reporters,sections,articles_per_week
0,NYT,2000,25,13,200
6,NYT,3000,26,6,506


In [37]:
tmp = (dflarge['medium'] == 'NYT')
print(type(tmp))
print(tmp)

<class 'pandas.core.series.Series'>
0      True
1     False
2     False
3     False
4     False
5     False
6      True
7     False
8     False
9     False
10    False
11    False
Name: medium, dtype: bool


Several things are to be noticed in this example:
- the __dflarge['medium'] == 'NYT'__ statement returns a Pandas Series of True/False values
- we can use this Series as input for the __loc__ statement to retrieve all the rows where our statement matches the desired result
- in our case we were looking for all rows containing NYT as the medium

In [38]:
#this works also for the larger than / smaller than operator
dflarge.loc[dflarge['articles'] > 2000]

Unnamed: 0,medium,articles,reporters,sections,articles_per_week
1,WP,8000,76,25,250
2,LAT,3000,30,23,190
4,BBC,12000,100,12,120
6,NYT,3000,26,6,506
7,WP,8000,71,10,116
8,LAT,5000,10,23,213


In [39]:
#again, if you want to filter out specific columns, use the second argument
dflarge.loc[dflarge['articles'] > 2000, ['medium','articles']]

Unnamed: 0,medium,articles
1,WP,8000
2,LAT,3000
4,BBC,12000
6,NYT,3000
7,WP,8000
8,LAT,5000


In [40]:
df_new

Unnamed: 0,reporters,articles,long name
0,25,2000,The New York Times
1,76,8000,Washington Post
2,30,3000,Los Angeles Times
3,10,500,Cable News Network
4,100,12000,British Broadcasting Company
5,15,1000,The Guardian


In [42]:
#Setting the values in a column using the.loc indexer
df_new.loc['Washington Post','reporters'] = 80
df_new

Unnamed: 0,reporters,articles,long name
0,25.0,2000.0,The New York Times
1,76.0,8000.0,Washington Post
2,30.0,3000.0,Los Angeles Times
3,10.0,500.0,Cable News Network
4,100.0,12000.0,British Broadcasting Company
5,15.0,1000.0,The Guardian
Washington Post,80.0,,


### Tasks
Have a look at the comments in the following cells. For each comment, use a new cell and subsequently solve the task demanded in the respective cell. Use the data from this DataFrame and make use of the __.loc__ indexer in each task:

In [41]:
dflarge

Unnamed: 0,medium,articles,reporters,sections,articles_per_week
0,NYT,2000,25,13,200
1,WP,8000,76,25,250
2,LAT,3000,30,23,190
3,CNN,500,10,22,222
4,BBC,12000,100,12,120
5,TG,1000,15,3,300
6,NYT,3000,26,6,506
7,WP,8000,71,10,116
8,LAT,5000,10,23,213
9,CNN,500,10,19,119


In [43]:
# Task 1: Select rows with medium LAT and all columns between 'articles' and 'sections'
dflarge.loc[dflarge.medium == "LAT", "articles":"sections"]

Unnamed: 0,articles,reporters,sections
2,3000,30,23
8,5000,10,23


In [52]:
# Task 1: Select rows with medium LAT and all columns between 'articles' and 'sections'
dflarge.loc[dflarge['medium'] == 'LAT', 'articles':'sections']

Unnamed: 0,articles,reporters,sections
2,3000,30,23
8,5000,10,23


In [51]:
# Task 2: Select rows where the medium column ends with 'T'
dflarge.loc[dflarge.medium.str.endswith('T')]
# or
dflarge.loc[dflarge.medium.str[-1] == 'T']

Unnamed: 0,medium,articles,reporters,sections,articles_per_week
0,NYT,2000,25,13,200
2,LAT,3000,30,23,190
6,NYT,3000,26,6,506
8,LAT,5000,10,23,213


In [52]:
# Task 2: Select rows where the medium column ends with 'T'
dflarge.loc[dflarge['medium'].str.endswith('T')]

Unnamed: 0,medium,articles,reporters,sections,articles_per_week
0,NYT,2000,25,13,200
2,LAT,3000,30,23,190
6,NYT,3000,26,6,506
8,LAT,5000,10,23,213


In [53]:
# Task 3: Select rows with medium equal to the values in this list: ['BBC', 'CNN', 'WP']
dflarge.loc[dflarge.medium.isin(['BBC', 'CNN', 'WP'])]

Unnamed: 0,medium,articles,reporters,sections,articles_per_week
1,WP,8000,76,25,250
3,CNN,500,10,22,222
4,BBC,12000,100,12,120
7,WP,8000,71,10,116
9,CNN,500,10,19,119
10,BBC,1000,101,18,418


In [54]:
# Task 3: Select rows with medium equal to the values in this list: ['BBC', 'CNN', 'WP']
dflarge.loc[dflarge['medium'].isin(['BBC', 'CNN', 'WP'])] 

Unnamed: 0,medium,articles,reporters,sections,articles_per_week
1,WP,8000,76,25,250
3,CNN,500,10,22,222
4,BBC,12000,100,12,120
7,WP,8000,71,10,116
9,CNN,500,10,19,119
10,BBC,1000,101,18,418


In [57]:
# Task 4: Select rows with medium WP and 3000 articles or more
dflarge.loc[(dflarge.medium == 'WP') & (dflarge.articles >= 3000)]

Unnamed: 0,medium,articles,reporters,sections,articles_per_week
1,WP,8000,76,25,250
7,WP,8000,71,10,116


In [57]:
# Task 4: Select rows with medium WP and 3000 articles or more
dflarge.loc[dflarge['medium'].str.endswith('WP') & (dflarge['reporters'] >= 75)]

Unnamed: 0,medium,articles,reporters,sections,articles_per_week
1,WP,8000,76,25,250


In [63]:
# Task 5: Select rows with id column between 2 and 5, and just return 'reporters' and 'sections' columns
dflarge.loc[2:5, ['reporters','sections']]

Unnamed: 0,reporters,sections
2,30,23
3,10,22
4,100,12
5,15,3


In [58]:
# Task 5: Select rows with id column between 2 and 5, and just return 'reporters' and 'sections' columns
dflarge.loc[2:5, ['reporters', 'sections']] 

Unnamed: 0,reporters,sections
2,30,23
3,10,22
4,100,12
5,15,3


In [66]:
# Select rows where the length of the medium name is 3 letters; make use of a lambda function
dflarge.loc[dflarge.medium.apply(lambda x: len(x) == 3)]



Unnamed: 0,medium,articles,reporters,sections,articles_per_week
0,NYT,2000,25,13,200
2,LAT,3000,30,23,190
3,CNN,500,10,22,222
4,BBC,12000,100,12,120
6,NYT,3000,26,6,506
8,LAT,5000,10,23,213
9,CNN,500,10,19,119
10,BBC,1000,101,18,418


In [59]:
# Task 6: A lambda function that yields True/False values can also be used.
# Select rows where the length of the medium name is 3 letters; make use of a lambda function
dflarge.loc[dflarge['medium'].apply(lambda x: len(x) == 3)]

Unnamed: 0,medium,articles,reporters,sections,articles_per_week
0,NYT,2000,25,13,200
2,LAT,3000,30,23,190
3,CNN,500,10,22,222
4,BBC,12000,100,12,120
6,NYT,3000,26,6,506
8,LAT,5000,10,23,213
9,CNN,500,10,19,119
10,BBC,1000,101,18,418


In [68]:
# Form a separate variable 'idx' with your selections from the lambda function from the cell above
idx = dflarge.medium.apply(lambda x: len(x) == 3)
# Select only the True values in 'idx' and only the 3 columns 'reporters', 'sections', 'articles_per_week':
dflarge.loc[idx, ['reporters', 'sections', 'articles_per_week']]





Unnamed: 0,reporters,sections,articles_per_week
0,25,13,200
2,30,23,190
3,10,22,222
4,100,12,120
6,26,6,506
8,10,23,213
9,10,19,119
10,101,18,418


In [48]:
# Form a separate variable 'idx' with your selections from the lambda function from the cell above
idx = _#dflarge['medium'].apply(lambda x: len(x) == 3)

# Select only the True values in 'idx' and only the 3 columns 'reporters', 'sections', 'articles_per_week':
