## Data Wrangling with Pandas

In [1]:
import pandas as pd
import os
from pathlib import Path
data_dir = os.path.join(Path(os.getcwd()).parents[2], "data", "ntbk_data", "04_data")

In [2]:
df = pd.read_csv(os.path.join(data_dir, 'media.csv'))
df

Unnamed: 0.1,Unnamed: 0,art_id,art_content,art_comment,art_date_grabbed
0,0,76,The Democratic Unionist party cannot be allowe...,3,2018-09-23 23:33:02
1,1,77,"Chas Hodges, of the musical duo Chas and Dave...",3,2018-09-23 23:33:23
2,2,78,T he official theme of this year’s UN general ...,3,2018-09-23 23:33:27
3,3,79,Dawn Butler has spoken approvingly about Labou...,3,2018-09-23 23:33:30
4,4,80,Drug producers are capitalising on the rise of...,9,2018-09-23 23:33:37
...,...,...,...,...,...
235,235,1302,Theresa May will come under intense pressure f...,3,2018-09-23 23:51:59
236,236,1303,Senior allies of Jeremy Corbyn questioned the ...,3,2018-09-23 23:52:06
237,237,1304,A man in his 80s has been arrested on suspicio...,3,2018-09-23 23:52:09
238,238,1305,"A year ago, the Labour party conference in Bri...",9,2018-09-23 23:52:12


### Dropping elements in a DataFrame

If we want to get rid of specific elements in a DataFrame, we can use the drop command:

In [3]:
df.drop([0, 1])

Unnamed: 0.1,Unnamed: 0,art_id,art_content,art_comment,art_date_grabbed
2,2,78,T he official theme of this year’s UN general ...,3,2018-09-23 23:33:27
3,3,79,Dawn Butler has spoken approvingly about Labou...,3,2018-09-23 23:33:30
4,4,80,Drug producers are capitalising on the rise of...,9,2018-09-23 23:33:37
5,5,81,"At least 29 people, including children, have b...",3,2018-09-23 23:33:40
6,6,82,Thousands of people marched to Whitehall on Sa...,3,2018-09-23 23:33:43
...,...,...,...,...,...
235,235,1302,Theresa May will come under intense pressure f...,3,2018-09-23 23:51:59
236,236,1303,Senior allies of Jeremy Corbyn questioned the ...,3,2018-09-23 23:52:06
237,237,1304,A man in his 80s has been arrested on suspicio...,3,2018-09-23 23:52:09
238,238,1305,"A year ago, the Labour party conference in Bri...",9,2018-09-23 23:52:12


As you can see, we passed along a list of elements we want to drop, in this case integer values. This means, we effectively remove the first two rows in the DataFrame. However, if we print the content of the DataFrame again, the following happens:

In [4]:
df

Unnamed: 0.1,Unnamed: 0,art_id,art_content,art_comment,art_date_grabbed
0,0,76,The Democratic Unionist party cannot be allowe...,3,2018-09-23 23:33:02
1,1,77,"Chas Hodges, of the musical duo Chas and Dave...",3,2018-09-23 23:33:23
2,2,78,T he official theme of this year’s UN general ...,3,2018-09-23 23:33:27
3,3,79,Dawn Butler has spoken approvingly about Labou...,3,2018-09-23 23:33:30
4,4,80,Drug producers are capitalising on the rise of...,9,2018-09-23 23:33:37
...,...,...,...,...,...
235,235,1302,Theresa May will come under intense pressure f...,3,2018-09-23 23:51:59
236,236,1303,Senior allies of Jeremy Corbyn questioned the ...,3,2018-09-23 23:52:06
237,237,1304,A man in his 80s has been arrested on suspicio...,3,2018-09-23 23:52:09
238,238,1305,"A year ago, the Labour party conference in Bri...",9,2018-09-23 23:52:12


So, the rows are basically still in place. We have two options here, one is, to assign the result of the drop-operation to a new variable:

In [5]:
df_new = df.drop([0, 1])
df_new

Unnamed: 0.1,Unnamed: 0,art_id,art_content,art_comment,art_date_grabbed
2,2,78,T he official theme of this year’s UN general ...,3,2018-09-23 23:33:27
3,3,79,Dawn Butler has spoken approvingly about Labou...,3,2018-09-23 23:33:30
4,4,80,Drug producers are capitalising on the rise of...,9,2018-09-23 23:33:37
5,5,81,"At least 29 people, including children, have b...",3,2018-09-23 23:33:40
6,6,82,Thousands of people marched to Whitehall on Sa...,3,2018-09-23 23:33:43
...,...,...,...,...,...
235,235,1302,Theresa May will come under intense pressure f...,3,2018-09-23 23:51:59
236,236,1303,Senior allies of Jeremy Corbyn questioned the ...,3,2018-09-23 23:52:06
237,237,1304,A man in his 80s has been arrested on suspicio...,3,2018-09-23 23:52:09
238,238,1305,"A year ago, the Labour party conference in Bri...",9,2018-09-23 23:52:12


That seems to work. The other is, to make use of the __inplace__ parameter:

In [6]:
df.drop([0, 1], inplace=True)
df

Unnamed: 0.1,Unnamed: 0,art_id,art_content,art_comment,art_date_grabbed
2,2,78,T he official theme of this year’s UN general ...,3,2018-09-23 23:33:27
3,3,79,Dawn Butler has spoken approvingly about Labou...,3,2018-09-23 23:33:30
4,4,80,Drug producers are capitalising on the rise of...,9,2018-09-23 23:33:37
5,5,81,"At least 29 people, including children, have b...",3,2018-09-23 23:33:40
6,6,82,Thousands of people marched to Whitehall on Sa...,3,2018-09-23 23:33:43
...,...,...,...,...,...
235,235,1302,Theresa May will come under intense pressure f...,3,2018-09-23 23:51:59
236,236,1303,Senior allies of Jeremy Corbyn questioned the ...,3,2018-09-23 23:52:06
237,237,1304,A man in his 80s has been arrested on suspicio...,3,2018-09-23 23:52:09
238,238,1305,"A year ago, the Labour party conference in Bri...",9,2018-09-23 23:52:12


Works well - but notice the following:
<div class="alert alert-block alert-warning">
<b>Tip:</b> If you use this repeatedly, e.g. while reworking your code, the cached DataFrame may have already lost its content and you need to refresh the cells before. Otherwise, this results in an error message.
</div>

Dropping/removing columns works similar except for the fact that you have to change the orientation of the elements to be dropped - use the ax is parameter as in the following example to achieve this: 

In [7]:
df.drop('Unnamed: 0', axis=1)

Unnamed: 0,art_id,art_content,art_comment,art_date_grabbed
2,78,T he official theme of this year’s UN general ...,3,2018-09-23 23:33:27
3,79,Dawn Butler has spoken approvingly about Labou...,3,2018-09-23 23:33:30
4,80,Drug producers are capitalising on the rise of...,9,2018-09-23 23:33:37
5,81,"At least 29 people, including children, have b...",3,2018-09-23 23:33:40
6,82,Thousands of people marched to Whitehall on Sa...,3,2018-09-23 23:33:43
...,...,...,...,...
235,1302,Theresa May will come under intense pressure f...,3,2018-09-23 23:51:59
236,1303,Senior allies of Jeremy Corbyn questioned the ...,3,2018-09-23 23:52:06
237,1304,A man in his 80s has been arrested on suspicio...,3,2018-09-23 23:52:09
238,1305,"A year ago, the Labour party conference in Bri...",9,2018-09-23 23:52:12


Dropping a row if a certain value is not found can be achieved like this:

In [8]:
df[df.art_comment == 9]

Unnamed: 0.1,Unnamed: 0,art_id,art_content,art_comment,art_date_grabbed
4,4,80,Drug producers are capitalising on the rise of...,9,2018-09-23 23:33:37
10,10,86,,9,2018-09-23 23:34:04
11,11,87,When the southern Italian city of Matera found...,9,2018-09-23 23:34:08
12,12,88,Heidi Stephens Sat 22 Sep 2018 20.36 BST ...,9,2018-09-23 23:34:11
13,13,89,"B orn in West Yorkshire , Jodie Whittaker , 3...",9,2018-09-23 23:34:19
...,...,...,...,...,...
214,214,1227,"Alexandre Lacazette scored a beauty, Pierre-Em...",9,2018-09-23 23:50:44
216,216,1229,The match had been billed as the battle of the...,9,2018-09-23 23:50:49
217,217,1230,This 80-minute performance was a microcosm of ...,9,2018-09-23 23:50:55
231,231,1286,Rockney was the name given to the fusion of vi...,9,2018-09-23 23:51:45


Conversely, we remove all the rows if a certain string is missing:

In [9]:
res = df[df.art_id == 86]
type(res)

pandas.core.frame.DataFrame

In [10]:
df[pd.np.nan != df.art_content]

  df[pd.np.nan != df.art_content]


Unnamed: 0.1,Unnamed: 0,art_id,art_content,art_comment,art_date_grabbed
2,2,78,T he official theme of this year’s UN general ...,3,2018-09-23 23:33:27
3,3,79,Dawn Butler has spoken approvingly about Labou...,3,2018-09-23 23:33:30
4,4,80,Drug producers are capitalising on the rise of...,9,2018-09-23 23:33:37
5,5,81,"At least 29 people, including children, have b...",3,2018-09-23 23:33:40
6,6,82,Thousands of people marched to Whitehall on Sa...,3,2018-09-23 23:33:43
...,...,...,...,...,...
235,235,1302,Theresa May will come under intense pressure f...,3,2018-09-23 23:51:59
236,236,1303,Senior allies of Jeremy Corbyn questioned the ...,3,2018-09-23 23:52:06
237,237,1304,A man in his 80s has been arrested on suspicio...,3,2018-09-23 23:52:09
238,238,1305,"A year ago, the Labour party conference in Bri...",9,2018-09-23 23:52:12
