#### Download the latest dataset of Pleiades as in [csv](https://atlantides.org/downloads/pleiades/dumps/pleiades-places-latest.csv.gz), extract the file, and store the csv in the same directory with the notebook.

#### Imports
We import pandas and we change three options.

In [None]:
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

#### Let's read the Pleades csv file and create a DataFrame.

In [None]:
df = pd.read_csv('pleiades-places.csv')

In [None]:
df.head(2)

## iloc vs loc

Run the following to cells. Is there any difference in the result?

In [None]:
df = pd.read_csv('pleiades-places.csv')
df.head(2)

In [None]:
df.iloc[0]

In [None]:
df.loc[0]

The answer is no because in the first case the `iloc` was looking for the $0^{th}$ row while the `loc` was looking for the row that its index has the _label_ 0. In our example, the the result is the same row.

Let us read again the csv file but this time we are going to assign an index out of the existing columns and not let Pandas assign its own index. The csv file has a column entitled "id" which I will use as an index.

In [None]:
df = pd.read_csv('pleiades-places.csv', index_col='id')
df.head(2)

Run again the following two cells.

In [None]:
df.iloc[0]

As you might expected, the `iloc` will find the $0^{th}$ row. Let us run now the `loc[0]`

In [None]:
df.loc[0]

You are getting an error message because the `loc` cannot find a row with an index "0". Since we changed the indices with Pleiades ids, there is no 0 in the indices. Notice now what is happening if we search for a row that has a Pleiades id as index.

In [None]:
df.loc[48210385]

As you see the `loc` can find a row that has as an index the Pleiades ID we were searching.

---

## Search in Pandas

There are multiple ways that we can query a pandas DataFrame. Let's start by reading the speech of Hawking.

In [None]:
with open('Hawking-Questioning-the-Universe.txt', 'r') as f:
    text_as_list = f.read().split(' ')
    
# and remove the punctuation    
import string
text_as_list = [i.translate(str.maketrans('', '', string.punctuation)).lower() for i in text_as_list]    

Let us create a dictionary `dict_counter` that takes as keys the words of text_as_list. The values of each key should be the number that each words is attested in the list.

Hint: Check the [`count()` method](https://www.w3schools.com/python/ref_list_count.asp).

In [None]:
dict_counter = {}

for i in text_as_list:
    dict_counter[i] = text_as_list.count(i)

and now we will create a DataFrame with the dictionary dict_counter.

In [None]:
df = pd.DataFrame(dict_counter.items())

before you run the next line, can you predict how many columns does the dataframe have and what is its index?

In [None]:
df.head(5)

It is fair to conclude that having 0 and 1 as titles for the dataframe is not very helful. Let us change this. In the DataFrame below, I am making two columns. One with the words and one with their occurances. I name the columns accordingly.

In [None]:
df = pd.DataFrame({'words': dict_counter.keys(), 'occurances': dict_counter.values()})
df.head(3)

Notice that we did not write `dict_counter.items()` but I imported two columns, the keys and the values.

Now, let's search in the words the item galaxy. What I wrote below is a filter that checks for the word "galaxy."

In [None]:
df['words'] == "galaxy"

Now I will create a new dataframe `df[]` and inside the square brackets I am going to put the filter `df['words'] == "galaxy"`.

In [None]:
df[df['words'] == "galaxy"]

That's great! Yet let's say that I am not interesting in getting a DataFrame but just the occurances. 

In [None]:
df[df['words'] == "galaxy"]['occurances']

Let's put all this inside a `int()` in order to present it as an integer.

In [None]:
int(df[df['words'] == "galaxy"]['occurances'])

Great! The word galaxy is attested two times!

Apropos, notice that the two lines below produce the same result. I simply moved the ['occurances'] after the first df:
- `df[df['words'] == "galaxy"]['occurances']`
- `df['occurances'][df['words'] == "galaxy"]`

The former first makes a dataframe based on a filter and then checks for the columns occurances. The latter first gets the column occurances on which it applies later the filter.

This line of code `int(df[df['words'] == "galaxy"]['occurances'])`is efficient but we can query our DataFrame in another way.

### Looking for galaxies with `loc[]`

As we saw above, the `loc[]` is looking for an index with the specific parameter. For example the `loc['galaxy']` is searching for the index named "galaxy". Yet, before we do that, let us see first what's the index of the DataFrame and how we can change it.

In [None]:
df.head()

The index is a number. Let's change that. The line below says make the column words as the index of the DataFrame.

Note: if you run this code more than once, you will get an error message. If this is the case create a new cell and run `df.reset_index(inplace=True)`.

In [None]:
df.set_index("words")

Yet, run the DataFrame again. 

In [None]:
df.head(2)

The index changed back to the automatic index of pandas. Or actually, when we tried to make the column words as the index, we did it only temporarily. If we want to change the index permanently, we have to pass the parameter `inplace=True` that changes the original DataFrame.

In [None]:
df.set_index("words", inplace=True)
df.head(2)

As you see, now no matter how many times you run `df.head(2)` the column "words" remains the index.

Let's return to the original question, how can we find the occurances of "galaxy" with `loc`?

In [None]:
df.loc['galaxy']

Let's put that inside an int()

In [None]:
int(df.loc['galaxy'])

Great! Two galaxies!

Let's reset the index...

In [None]:
df.reset_index(inplace=True)

 and let's see which code is more efficient. Is this better?

In [None]:
int(df[df['words'] == "galaxy"]['occurances'])

or is this better?

In [None]:
df.set_index("words", inplace=True)
int(df.loc['galaxy'])

What do you prefer? There is no right or wrong answer here. It's just a matter of preference.

Let's now search for "galaxy" or "galaxies". I think that the `loc` property is not very efficient in this example.

First we have to reset the index.

In [None]:
df.reset_index(inplace=True)

Then let's write clearly our filters.
- `df['words'] == "galaxy"`
- `df['words'] == "galaxies"`

Since the new dataframe has two filters we have to put them in parentheses with are connected with an OR (`|`). Now let's put it inside a `df[ ]`.

In [None]:
df[(df['words'] == "galaxy") | (df['words'] == "galaxies")]

At the end of the above code, I will add a `.sum()`.

In [None]:
df[(df['words'] == "galaxy") | (df['words'] == "galaxies")].sum()

As you see the `sum()` added the numbers but it concatenated the strings. Since we only need the column "occurances," we have to add `['occurances']`:

In [None]:
df[(df['words'] == "galaxy") | (df['words'] == "galaxies")]["occurances"].sum()

---