#### Download the latest dataset of Pleiades as in [csv](https://atlantides.org/downloads/pleiades/dumps/pleiades-places-latest.csv.gz), extract the file, and store the csv in the same directory with the notebook.

#### Imports
We import pandas and we change three options.

In [2]:
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

#### Let's read the Pleades csv file and create a DataFrame.

In [3]:
df = pd.read_csv('pleiades-places.csv')

In [4]:
df.head(2)

Unnamed: 0,authors,bbox,connectsWith,created,creators,currentVersion,description,extent,featureTypes,geoContext,hasConnectionsWith,id,locationPrecision,maxDate,minDate,modified,path,reprLat,reprLatLong,reprLong,tags,timePeriods,timePeriodsKeys,timePeriodsRange,title,uid
0,"Becker, J., T. Elliott","13.4119837, 42.082885, 13.4119837, 42.082885",413005,2016-11-04T16:36:09Z,"jbecker, thomase",1.0,The post-Roman settlement at Alba Fucens becam...,"{""type"": ""Point"", ""coordinates"": [13.4119837, ...",settlement,,,48210385,precise,1453.0,640.0,2016-11-08T21:58:28Z,/places/48210385,42.082885,"42.082885,13.4119837",13.411984,,M,mediaeval-byzantine,"640.0,1453.0",Borgo Medievale,ece5760c4c6d42c1a331aad543c4ecc4
1,"Becker, J., T. Elliott","11.6285463, 42.4193742, 11.6285463, 42.4193742",413393,2016-11-04T16:39:09Z,"jbecker, thomase",2.0,A major urban sanctuary at Vulci with a long p...,"{""type"": ""Point"", ""coordinates"": [11.6285463, ...",temple-2,,,48210386,precise,300.0,-750.0,2016-12-05T11:47:10Z,/places/48210386,42.419374,"42.4193742,11.6285463",11.628546,"sanctuary, extant remains, temple",ACHR,"archaic,classical,hellenistic-republican,roman","-750.0,300.0",Tempio Grande at Vulci,4e06898f2de74dbc9f3a3bdba6d74ba2


## iloc vs loc

Run the following to cells. Is there any difference in the result?

In [5]:
df = pd.read_csv('pleiades-places.csv')
df.head(2)

Unnamed: 0,authors,bbox,connectsWith,created,creators,currentVersion,description,extent,featureTypes,geoContext,hasConnectionsWith,id,locationPrecision,maxDate,minDate,modified,path,reprLat,reprLatLong,reprLong,tags,timePeriods,timePeriodsKeys,timePeriodsRange,title,uid
0,"Becker, J., T. Elliott","13.4119837, 42.082885, 13.4119837, 42.082885",413005,2016-11-04T16:36:09Z,"jbecker, thomase",1.0,The post-Roman settlement at Alba Fucens becam...,"{""type"": ""Point"", ""coordinates"": [13.4119837, ...",settlement,,,48210385,precise,1453.0,640.0,2016-11-08T21:58:28Z,/places/48210385,42.082885,"42.082885,13.4119837",13.411984,,M,mediaeval-byzantine,"640.0,1453.0",Borgo Medievale,ece5760c4c6d42c1a331aad543c4ecc4
1,"Becker, J., T. Elliott","11.6285463, 42.4193742, 11.6285463, 42.4193742",413393,2016-11-04T16:39:09Z,"jbecker, thomase",2.0,A major urban sanctuary at Vulci with a long p...,"{""type"": ""Point"", ""coordinates"": [11.6285463, ...",temple-2,,,48210386,precise,300.0,-750.0,2016-12-05T11:47:10Z,/places/48210386,42.419374,"42.4193742,11.6285463",11.628546,"sanctuary, extant remains, temple",ACHR,"archaic,classical,hellenistic-republican,roman","-750.0,300.0",Tempio Grande at Vulci,4e06898f2de74dbc9f3a3bdba6d74ba2


In [6]:
df.iloc[0]

authors                                          Becker, J., T. Elliott
bbox                       13.4119837, 42.082885, 13.4119837, 42.082885
connectsWith                                                     413005
created                                            2016-11-04T16:36:09Z
creators                                               jbecker, thomase
currentVersion                                                      1.0
description           The post-Roman settlement at Alba Fucens becam...
extent                {"type": "Point", "coordinates": [13.4119837, ...
featureTypes                                                 settlement
geoContext                                                          NaN
hasConnectionsWith                                                  NaN
id                                                             48210385
locationPrecision                                               precise
maxDate                                                         

In [7]:
df.loc[0]

authors                                          Becker, J., T. Elliott
bbox                       13.4119837, 42.082885, 13.4119837, 42.082885
connectsWith                                                     413005
created                                            2016-11-04T16:36:09Z
creators                                               jbecker, thomase
currentVersion                                                      1.0
description           The post-Roman settlement at Alba Fucens becam...
extent                {"type": "Point", "coordinates": [13.4119837, ...
featureTypes                                                 settlement
geoContext                                                          NaN
hasConnectionsWith                                                  NaN
id                                                             48210385
locationPrecision                                               precise
maxDate                                                         

The answer is no because in the first case the `iloc` was looking for the $0^{th}$ row while the `loc` was looking for the row that its index has the _label_ 0. In our example, the the result is the same row.

Let us read again the csv file but this time we are going to assign an index out of the existing columns and not let Pandas assign its own index. The csv file has a column entitled "id" which I will use as an index.

In [8]:
df = pd.read_csv('pleiades-places.csv', index_col='id')
df.head(2)

Unnamed: 0_level_0,authors,bbox,connectsWith,created,creators,currentVersion,description,extent,featureTypes,geoContext,hasConnectionsWith,locationPrecision,maxDate,minDate,modified,path,reprLat,reprLatLong,reprLong,tags,timePeriods,timePeriodsKeys,timePeriodsRange,title,uid
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1
48210385,"Becker, J., T. Elliott","13.4119837, 42.082885, 13.4119837, 42.082885",413005,2016-11-04T16:36:09Z,"jbecker, thomase",1.0,The post-Roman settlement at Alba Fucens becam...,"{""type"": ""Point"", ""coordinates"": [13.4119837, ...",settlement,,,precise,1453.0,640.0,2016-11-08T21:58:28Z,/places/48210385,42.082885,"42.082885,13.4119837",13.411984,,M,mediaeval-byzantine,"640.0,1453.0",Borgo Medievale,ece5760c4c6d42c1a331aad543c4ecc4
48210386,"Becker, J., T. Elliott","11.6285463, 42.4193742, 11.6285463, 42.4193742",413393,2016-11-04T16:39:09Z,"jbecker, thomase",2.0,A major urban sanctuary at Vulci with a long p...,"{""type"": ""Point"", ""coordinates"": [11.6285463, ...",temple-2,,,precise,300.0,-750.0,2016-12-05T11:47:10Z,/places/48210386,42.419374,"42.4193742,11.6285463",11.628546,"sanctuary, extant remains, temple",ACHR,"archaic,classical,hellenistic-republican,roman","-750.0,300.0",Tempio Grande at Vulci,4e06898f2de74dbc9f3a3bdba6d74ba2


Run again the following two cells.

In [11]:
df.iloc[0]

authors                                          Becker, J., T. Elliott
bbox                       13.4119837, 42.082885, 13.4119837, 42.082885
connectsWith                                                     413005
created                                            2016-11-04T16:36:09Z
creators                                               jbecker, thomase
currentVersion                                                      1.0
description           The post-Roman settlement at Alba Fucens becam...
extent                {"type": "Point", "coordinates": [13.4119837, ...
featureTypes                                                 settlement
geoContext                                                          NaN
hasConnectionsWith                                                  NaN
locationPrecision                                               precise
maxDate                                                          1453.0
minDate                                                         

As you might expected, the `iloc` will find the $0^{th}$ row. Let us run now the `loc[0]`

In [12]:
df.loc[0]

KeyError: 0

You are getting an error message because the `loc` cannot find a row with an index "0". Since we changed the indices with Pleiades ids, there is no 0 in the indices. Notice now what is happening if we search for a row that has a Pleiades id as index.

In [13]:
df.loc[48210385]

authors                                          Becker, J., T. Elliott
bbox                       13.4119837, 42.082885, 13.4119837, 42.082885
connectsWith                                                     413005
created                                            2016-11-04T16:36:09Z
creators                                               jbecker, thomase
currentVersion                                                      1.0
description           The post-Roman settlement at Alba Fucens becam...
extent                {"type": "Point", "coordinates": [13.4119837, ...
featureTypes                                                 settlement
geoContext                                                          NaN
hasConnectionsWith                                                  NaN
locationPrecision                                               precise
maxDate                                                          1453.0
minDate                                                         

As you see the `loc` can find a row that has as an index the Pleiades ID we were searching.

---

## Search in Pandas

There are multiple ways that we can query a pandas DataFrame. Let's start by reading the speech of Hawking.

In [14]:
with open('Hawking-Questioning-the-Universe.txt', 'r') as f:
    text_as_list = f.read().split(' ')
    
# and remove the punctuation    
import string
text_as_list = [i.translate(str.maketrans('', '', string.punctuation)).lower() for i in text_as_list]    

Let us create a dictionary `dict_counter` that takes as keys the words of text_as_list. The values of each key should be the number that each words is attested in the list.

Hint: Check the [`count()` method](https://www.w3schools.com/python/ref_list_count.asp).

In [15]:
dict_counter = {}

for i in text_as_list:
    dict_counter[i] = text_as_list.count(i)

and now we will create a DataFrame with the dictionary dict_counter.

In [20]:
df = pd.DataFrame(dict_counter.items())

before you run the next line, can you predict how many columns does the dataframe have and what is its index?

In [21]:
df.head(5)

Unnamed: 0,0,1
0,there,10
1,is,13
2,nothing,2
3,bigger,1
4,or,5


It is fair to conclude that having 0 and 1 as titles for the dataframe is not very helful. Let us change this. In the DataFrame below, I am making two columns. One with the words and one with their occurances. I name the columns accordingly.

In [22]:
df = pd.DataFrame({'words': dict_counter.keys(), 'occurances': dict_counter.values()})
df.head(3)

Unnamed: 0,words,occurances
0,there,10
1,is,13
2,nothing,2


Notice that we did not write `dict_counter.items()` but I imported two columns, the keys and the values.

Apropos, the above line of code has the same result as `df = pd.DataFrame(dict_counter.items(), columns=["words", "age"])`

Now, let's search in the words the item galaxy. What I wrote below is a filter that checks for the word "galaxy."

In [23]:
df['words'] == "galaxy"

0      False
1      False
2      False
3      False
4      False
5      False
6      False
7      False
8      False
9      False
10     False
11     False
12     False
13     False
14     False
15     False
16     False
17     False
18     False
19     False
20     False
21     False
22     False
23     False
24     False
25     False
26     False
27     False
28     False
29     False
30     False
31     False
32     False
33     False
34     False
35     False
36     False
37     False
38     False
39     False
40     False
41     False
42     False
43     False
44     False
45     False
46     False
47     False
48     False
49     False
50     False
51     False
52     False
53     False
54     False
55     False
56     False
57     False
58     False
59     False
60     False
61     False
62     False
63     False
64     False
65     False
66     False
67     False
68     False
69     False
70     False
71     False
72     False
73     False
74     False
75     False
76     False

Now I will create a new dataframe `df[]` and inside the square brackets I am going to put the filter `df['words'] == "galaxy"`.

In [26]:
df[df['words'] == "galaxy"]

2

That's great! Yet let's say that I am not interesting in getting a DataFrame but just the occurances. 

In [27]:
df[df['words'] == "galaxy"]['occurances']

197    2
Name: occurances, dtype: int64

Let's put all this inside a `int()` in order to present it as an integer.

In [28]:
int(df[df['words'] == "galaxy"]['occurances'])

2

Great! The word galaxy is attested two times!

Apropos, notice that the two lines below produce the same result. I simply moved the ['occurances'] after the first df:
- `df[df['words'] == "galaxy"]['occurances']`
- `df['occurances'][df['words'] == "galaxy"]`

The former first makes a dataframe based on a filter and then checks for the columns occurances. The latter first gets the column occurances on which it applies later the filter.

This line of code `int(df[df['words'] == "galaxy"]['occurances'])`is efficient but we can query our DataFrame in another way.

### Looking for galaxies with `loc[]`

As we saw above, the `loc[]` is looking for an index with the specific parameter. For example the `loc['galaxy']` is searching for the index named "galaxy". Yet, before we do that, let us see first what's the index of the DataFrame and how we can change it.

In [29]:
df.head()

Unnamed: 0,words,occurances
0,there,10
1,is,13
2,nothing,2
3,bigger,1
4,or,5


The index is a number. Let's change that. The line below says make the column words as the index of the DataFrame.

Note: if you run this code more than once, you will get an error message. If this is the case create a new cell and run `df.reset_index(inplace=True)`.

In [30]:
df.set_index("words")

Unnamed: 0_level_0,occurances
words,Unnamed: 1_level_1
there,10
is,13
nothing,2
bigger,1
or,5
older,1
than,2
the,77
universe,19
questions,4


Yet, run the DataFrame again. 

In [31]:
df.head(2)

Unnamed: 0,words,occurances
0,there,10
1,is,13


The index changed back to the automatic index of pandas. Or actually, when we tried to make the column words as the index, we did it only temporarily. If we want to change the index permanently, we have to pass the parameter `inplace=True` that changes the original DataFrame. Why? 

"Whenever you call a method on a DataFrame in the form `df.method_name()`, you will get back a copy of the DataFrame with
that method applied, leaving the original DataFrame untouched. We have just done that by calling df.reset_index(). If you wanted to change the original DataFrame, you would have to assign the return value back to the original variable like the following: 
`df = df.reset_index()`

Since we are not doing this, it means that our variable df is still holding its original data." Zumstein, F. (2021) _Python for Excel. A Modern Environment for Automation and Data Analysis_, p. 89.

In [32]:
df.set_index("words", inplace=True)
df.head(2)

Unnamed: 0_level_0,occurances
words,Unnamed: 1_level_1
there,10
is,13


As you see, now no matter how many times you run `df.head(2)` the column "words" remains the index.

Let's return to the original question, how can we find the occurances of "galaxy" with `loc`?

In [35]:
df.loc['galaxy']

occurances    2
Name: galaxy, dtype: int64

Let's put that inside an int()

In [36]:
int(df.loc['galaxy'])

2

Great! Two galaxies!

Let's reset the index...

In [37]:
df.reset_index(inplace=True)

 and let's see which code is more efficient. Is this better?

In [39]:
int(df[df['words'] == "galaxy"]['occurances'])

2

or is this better?

In [40]:
df.set_index("words", inplace=True)
int(df.loc['galaxy'])

2

What do you prefer? There is no right or wrong answer here. It's just a matter of preference.

Let's now search for "galaxy" or "galaxies". I think that the `loc` property is not very efficient in this example.

First we have to reset the index.

In [41]:
df.reset_index(inplace=True)

Then let's write clearly our filters.
- `df['words'] == "galaxy"`
- `df['words'] == "galaxies"`

Since the new dataframe has two filters we have to put them in parentheses with are connected with an OR (`|`). Now let's put it inside a `df[ ]`.

In [42]:
df[(df['words'] == "galaxy") | (df['words'] == "galaxies")]

Unnamed: 0,words,occurances
53,galaxies,1
197,galaxy,2


At the end of the above code, I will add a `.sum()`.

In [43]:
df[(df['words'] == "galaxy") | (df['words'] == "galaxies")].sum()

words         galaxiesgalaxy
occurances                 3
dtype: object

As you see the `sum()` added the numbers but it concatenated the strings. Since we only need the column "occurances," we have to add `['occurances']`:

In [44]:
df[(df['words'] == "galaxy") | (df['words'] == "galaxies")]["occurances"].sum()

3

---