# Looking at DataFrame Data

1. Run the cell below to import required libraries and create a DataFrame

In [3]:
import pandas as pd
import numpy as np
import random

num_rows = 100
colors = ['Red', 'Blue', 'Green']

df = pd.DataFrame( {'color': [colors[random.randint(0,2)] for _ in range(num_rows)],
                    'integers': [random.randint(0,15) for _ in range(num_rows)],
                    'floats': [random.random() for _ in range(num_rows)]})
df

Unnamed: 0,color,integers,floats
0,Blue,15,0.799567
1,Blue,9,0.911170
2,Red,12,0.596825
3,Red,10,0.075444
4,Green,2,0.208097
...,...,...,...
95,Red,9,0.125208
96,Blue,15,0.386884
97,Red,7,0.829760
98,Red,7,0.875479


2. Use the DataFrame `head()` method to view the top five rows. Try giving it a number as an argument to control how many rows are displayed.

In [4]:
df.head()

Unnamed: 0,color,integers,floats
0,Blue,15,0.799567
1,Blue,9,0.91117
2,Red,12,0.596825
3,Red,10,0.075444
4,Green,2,0.208097


3. View summary statistics using the DataFrame `describe()` method.

In [5]:
df.describe()

Unnamed: 0,integers,floats
count,100.0,100.0
mean,7.74,0.522942
std,4.652837,0.276242
min,0.0,0.021771
25%,3.75,0.312648
50%,8.0,0.518277
75%,12.0,0.757754
max,15.0,0.997878


4. The `decribe()` method accepts some optional arguments, including 'include' and 'exclude'. By default, `describe()` only shows statistics for columns with numerical data, but if you add the argument `include=np.object`, it will display statistics for columns with string data. Try this.

In [6]:
df.describe(include=np.object)

Unnamed: 0,color
count,100
unique,3
top,Green
freq,41


5. If you change the argument to `include='all'`, it will display statistics for all columns in the data frame, inserting `NaN` (not a number) when the data type is not appropriate for the statistic. Try viewing statistics for all frames using `describe()`.

In [7]:
df.describe(include='all')

Unnamed: 0,color,integers,floats
count,100,100.0,100.0
unique,3,,
top,Green,,
freq,41,,
mean,,7.74,0.522942
std,,4.652837,0.276242
min,,0.0,0.021771
25%,,3.75,0.312648
50%,,8.0,0.518277
75%,,12.0,0.757754


## Selecting Data
6. You can select a column using bracket syntax very similar to that used with dictionaries. Put the column name, as a string, in brackets after the DataFrame name. Try this with the column 'color'

In [8]:
df['color']

0      Blue
1      Blue
2       Red
3       Red
4     Green
      ...  
95      Red
96     Blue
97      Red
98      Red
99      Red
Name: color, Length: 100, dtype: object

7. Try selecting the columns 'color' and 'floats' by supplying them as a list of strings in the same bracket syntax.

In [9]:
df[['color', 'floats']]

Unnamed: 0,color,floats
0,Blue,0.799567
1,Blue,0.911170
2,Red,0.596825
3,Red,0.075444
4,Green,0.208097
...,...,...
95,Red,0.125208
96,Blue,0.386884
97,Red,0.829760
98,Red,0.875479


8. The bracket syntax in DataFrames is overloaded to select rows as well. Selecting rows uses the syntax we used to select slices in Sequences: a start number, a colon, and an upper bound number. Try selecting three rows from the DataFrame using the slice `10:13`

In [20]:
df.iloc[10:13,0:1] #, [color,floats]]

Unnamed: 0,color
10,Red
11,Blue
12,Blue


9. Now let's try the `.loc[]` syntax. It also uses bracket syntax, but in this case you will specify both rows and columns to select. Select all of the rows by supplying a lone colon as the first argument, and the column 'color' by supplying it as a second argument (remember that arguments must be separted by a comma).

In [22]:
df.loc[10:13]

Unnamed: 0,color,integers,floats
10,Red,14,0.07292
11,Blue,14,0.816948
12,Blue,11,0.63853
13,Green,9,0.915158


10. Now specify a slice, `10:13`, for the first argument and a list of columns, `['color', 'integers']`, as a second, to select **four** rows (the upper bound in `loc[]` is included) and two columns.

In [23]:
df.loc[10:13, ['color', 'floats']]

Unnamed: 0,color,floats
10,Red,0.07292
11,Blue,0.816948
12,Blue,0.63853
13,Green,0.915158


11. Now try the `iloc[]` syntax. This used the position of rows and columns to determine selection. In this DataFrame, the labels for the rows are the same as their position, so we can use the same slice `10:13` as the first argument. For the second, use the slice `0:2` to select the first two columns. Notice that with `iloc[]`, the upper bound is not inclusive, so you will get three rows and two columns.

In [24]:
df.iloc[10:13, 0:2]

Unnamed: 0,color,integers
10,Red,14
11,Blue,14
12,Blue,11
