# Working with data
## Pandas data structures

***
<br>

## Two main data structures used by pandas

* There are two main data structure used by pandas, they are the **Series** and the **DataFrame**.
* The Series equates in general to a vector or a list.
* The DataFrame is equivalent to a table. Each column in a pandas DataFrame is a pandas Series data structure.


## Series data type

* As an analogy, we can compare `Series` objects to an Excel column.
* A `Series` works similarly to a list in Python, but gives us more possibilities.

In [20]:
import pandas as pd

# # creation of the Series object
series1 = pd.Series([-1,1,3,5,7])
print("series1:")
print(series1)

# creation of a new Series object as a result of an operation on an existing object
series2 = series1 * 10
print("series2:")
print(series2)

series1:
0   -1
1    1
2    3
3    5
4    7
dtype: int64
series2:
0   -10
1    10
2    30
3    50
4    70
dtype: int64


#### Examples of operations on Series objects

In [21]:
# the absolute value from the rows of the Series object
series2.abs()

0    10
1    10
2    30
3    50
4    70
dtype: int64

In [22]:
# generate descriptive statistics
series2.describe()

count     5.000000
mean     30.000000
std      31.622777
min     -10.000000
25%      10.000000
50%      30.000000
75%      50.000000
max      70.000000
dtype: float64

In [23]:
# set the index (identifier) of the Series elements
series2.index = ['First','Second','Third','Fourth','Fifth']
print(series2)

First    -10
Second    10
Third     30
Fourth    50
Fifth     70
dtype: int64


## DataFrame data type

* While we were comparing `Series` to a column in Excel, a `DataFrame` is the equivalent of a table, i.e. a summary of `Series`-type data.

In [24]:
df = pd.read_csv("data/sets.csv")
type(df)

pandas.core.frame.DataFrame

#### Creating a DataFrame

* Pandas, allows us to create DataFrames in several ways. Among others, based on a list, a dictionary or csv, xls, json files.

In [25]:
# DataFrame created from a list

my_list = [['Anna',24],['Michael',9],['John',40],['Eve',43]]
df_a = pd.DataFrame(my_list)
df_a.columns = ['First name', 'Age']
df_a

Unnamed: 0,First name,Age
0,Anna,24
1,Michael,9
2,John,40
3,Eve,43


In [26]:
# DataFrame created from a dictionary
b = {
    'Name': ['Eve','Michael','Christopher','Catherine','Diana'],
    'City': ['Warsaw','Krakow','Gdansk','Poznan','Lodz']
}
df_b = pd.DataFrame(b)
df_b

Unnamed: 0,Name,City
0,Eve,Warsaw
1,Michael,Krakow
2,Christopher,Gdansk
3,Catherine,Poznan
4,Diana,Lodz


In [27]:
# DataFrame created on the base on CSV file
df_c = pd.read_csv('https://cdn.rebrickable.com/media/downloads/sets.csv.gz')
df_c

Unnamed: 0,set_num,name,year,theme_id,num_parts
0,001-1,Gears,1965,1,43
1,0011-2,Town Mini-Figures,1979,67,12
2,0011-3,Castle 2 for 1 Bonus Offer,1987,199,0
3,0012-1,Space Mini-Figures,1979,143,12
4,0013-1,Space Mini-Figures,1979,143,12
...,...,...,...,...,...
19758,XWING-1,Mini X-Wing Fighter,2019,158,60
19759,XWING-2,X-Wing Trench Run,2019,158,52
19760,YODACHRON-1,Yoda Chronicles Promotional Set,2013,158,413
19761,YTERRIER-1,Yorkshire Terrier,2018,598,0


##### Elementary operations on a DataFrame

In [28]:
# retrieving the first n (in our case 7) rows
df_c.head(7)

Unnamed: 0,set_num,name,year,theme_id,num_parts
0,001-1,Gears,1965,1,43
1,0011-2,Town Mini-Figures,1979,67,12
2,0011-3,Castle 2 for 1 Bonus Offer,1987,199,0
3,0012-1,Space Mini-Figures,1979,143,12
4,0013-1,Space Mini-Figures,1979,143,12
5,0014-1,Space Mini-Figures,1979,143,12
6,0015-1,Space Mini-Figures,1979,143,18


In [29]:
# retrieving the last n (in our case 4) rows
df_c.tail(4)

Unnamed: 0,set_num,name,year,theme_id,num_parts
19759,XWING-2,X-Wing Trench Run,2019,158,52
19760,YODACHRON-1,Yoda Chronicles Promotional Set,2013,158,413
19761,YTERRIER-1,Yorkshire Terrier,2018,598,0
19762,ZX8000-1,ZX 8000 LEGO Sneaker,2020,501,0


In [30]:
# print a concise summary of a DataFrame
df_c.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19763 entries, 0 to 19762
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   set_num    19763 non-null  object
 1   name       19763 non-null  object
 2   year       19763 non-null  int64 
 3   theme_id   19763 non-null  int64 
 4   num_parts  19763 non-null  int64 
dtypes: int64(3), object(2)
memory usage: 772.1+ KB


In [31]:
# return a tuple representing the dimensionality (number of rows and columns) of the DataFrame
df_c.shape

(19763, 5)

In [32]:
# generate descriptive statistics of numeric type columns
df_c.describe()

Unnamed: 0,year,theme_id,num_parts
count,19763.0,19763.0,19763.0
mean,2006.691899,410.326367,160.784041
std,13.942504,196.515201,401.145022
min,1949.0,1.0,0.0
25%,2001.0,253.0,5.0
50%,2011.0,494.0,34.0
75%,2017.0,525.0,144.0
max,2022.0,725.0,11695.0


## --- Exercise ---

Load the data stored in the `data\FIC.csv` file and analyse its contents, i.e. the number of rows, columns, column types and ranges of values within them.

In [None]:
# Write your code here