# Hands on: Datasets and Attributes

### Overview

- [1 Create a dataframe](#ch1)

    - [1.1 with DataFrame: example 1](#ch1_1)

    - [1.2 with DataFrame: example 2](#ch1_2)
    
    - [1.3 with read_cvs](#ch1_3)
    
        - [1.3.1 reading a cvs file](#ch1_3_1)

- [2 Getting an array from a dataframe](#ch2)

- [3 Questions](#ch3)

- [4 DataFrames: some basics](#ch4)

In [2]:
# to execute and return the results of executions
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# 1 Creating a dataframe <a name="ch1"></a>

A DataFrame is a table. It contains an array of individual entries, each of which has a certain value. Each entry corresponds to a row (or record) and a column.


## 1.1 with DataFrame: example 1 <a name="ch1_1"></a>

Consider the following simple DataFrame:

In [4]:
import pandas as pd

df = pd.DataFrame([['green', 'M', 10.1, 'brandA', 'class1'],
                   ['red', 'XL', 13.5, 'brandB', 'class2'],
                   ['blue', 'L', 15.3, 'brandA', 'class1']])
df.columns = ['color', 'size', 'price', 'brand', 'classlabel']
df

Unnamed: 0,color,size,price,brand,classlabel
0,green,M,10.1,brandA,class1
1,red,XL,13.5,brandB,class2
2,blue,L,15.3,brandA,class1


## 1.2 with DataFrame: example 2 <a name="ch1_2"></a>

In [5]:
df = pd.DataFrame({'color': ['green', 'red', 'blue'],
                   'size':  ['M', 'XL', 'L'],
                   'price': [10.1,  13.5,  15.3],
                   'brand': ['brandA', 'brandB', 'brandA'],
                   'classlabel': ['class1', 'class2', 'class1']})
df

Unnamed: 0,color,size,price,brand,classlabel
0,green,M,10.1,brandA,class1
1,red,XL,13.5,brandB,class2
2,blue,L,15.3,brandA,class1


## 1.3 with read_cvs <a name="ch1_3"></a>

In [6]:
import pandas as pd
from io import StringIO

csv_data = '''color,size,price,brand,classlabel
green,M,10.1,brandA,class1
red,XL,13.5,brandB,class2
blue,L,15.3,brandA,class1'''

df = pd.read_csv(StringIO(csv_data))
df

Unnamed: 0,color,size,price,brand,classlabel
0,green,M,10.1,brandA,class1
1,red,XL,13.5,brandB,class2
2,blue,L,15.3,brandA,class1


### 1.3.1 reading a cvs file <a name="ch1_3_1"></a>

see help(pd.read_csv)

## 1.4 About this example

In this example:
- the "0, color" entry has the value of 'green';
- the "0, size" entry has a value of M;
- the "0, price" entry has the value of '10.1';
- the class label of entry "0" is 'classe1';
and so on...


(Notice that DataFrame entries are not limited to integers)

`Key` - column names

*Values* - list of entries

`Index` - list of rows

*By default values are assigned to the column labels in ascending order (0, 1, 2, 3, ...) for the row labels, i.e. `index` is in ascending order.*

### Assign values to column labels
is possible to assign values to it by using an `index` parameter in our constructor:

In [5]:
df_2 = pd.DataFrame({'color': ['green', 'red', 'blue'],
                   'size':  ['M', 'XL', 'L'],
                   'price': [10.1,  13.5,  15.3],
                   'brand': ['brandA', 'brandB', 'brandA'],
                   'classlabel': ['class1', 'class2', 'class1']}, 
                   index = ['Client_1', 'Client_2', 'Client_3'])
df_2

Unnamed: 0,color,size,price,brand,classlabel
Client_1,green,M,10.1,brandA,class1
Client_2,red,XL,13.5,brandB,class2
Client_3,blue,L,15.3,brandA,class1


# 2 Getting an array from a dataframe <a name="ch2"></a>

In [10]:
X1 = df[['color', 'size']].values
X1
X2 = df.loc[:,['color', 'size']].values
X2
X3 = df.iloc[:,0:2].values
X3


Y1 = df[['color', 'size']].to_numpy()
Y1
Y2 = df.loc[:,['color', 'size']].to_numpy()
Y2
Y3 = df.iloc[:,0:2].to_numpy()
Y3


array([['green', 'M'],
       ['red', 'XL'],
       ['blue', 'L']], dtype=object)

array([['green', 'M'],
       ['red', 'XL'],
       ['blue', 'L']], dtype=object)

array([['green', 'M'],
       ['red', 'XL'],
       ['blue', 'L']], dtype=object)

array([['green', 'M'],
       ['red', 'XL'],
       ['blue', 'L']], dtype=object)

array([['green', 'M'],
       ['red', 'XL'],
       ['blue', 'L']], dtype=object)

array([['green', 'M'],
       ['red', 'XL'],
       ['blue', 'L']], dtype=object)

In [None]:


Y1 = df[['color', 'size']].to_numpy()
Y1
Y2 = df.loc[:,['color', 'size']].to_numpy()
Y2
Y3 = df.iloc[:,0:2].to_numpy()
Y3


# 3 Questions <a name="ch3"></a>

**1) Identify the `keys` in dataframe.**

**2) To which concepts do you associate `key` and  `index`?**

**3) Identify the dimensionality and size.**  *Do you notice any particularity in this dataframe?*
dimension is 4 (numero de atributos) e size 3. More attributes than instances

**4) Identify the type and scale of attributes. Recall the proprieties of each.**
color é categórico e nominal / 
size é categórico e ordinal / distinguir e ordenar
brand é  categórico e nominal / 
price é numérico  e quantitativo / all proprieties
class label é binary e nominal / 

# 4 DataFrames: some basics <a name="ch4"></a>

#### Get the first n rows of a DataFrame

In [13]:
df.head(n=2) # default n= 5: df.head()

Unnamed: 0,color,size,price,brand,classlabel
0,green,M,10.1,brandA,class1
1,red,XL,13.5,brandB,class2


#### Get the last n rows of a DataFrame

In [8]:
df.tail(n=2)# default n= 5: df.tail()

Unnamed: 0,color,size,price,brand,classlabel
1,red,XL,13.5,brandB,class2
2,blue,L,15.3,brandA,class1


####  Get the data type of each column

In [9]:
df.dtypes

color          object
size           object
price         float64
brand          object
classlabel     object
dtype: object

#### Get the keys of a DataFrame: with `keys()`

In [10]:
df.keys()

Index(['color', 'size', 'price', 'brand', 'classlabel'], dtype='object')

#### Get the keys of a DataFrame: with `.`

In [11]:
df.columns

Index(['color', 'size', 'price', 'brand', 'classlabel'], dtype='object')

#### Get the index (row labels) of a DataFrame

In [12]:
df.index
df_2.index

RangeIndex(start=0, stop=3, step=1)

Index(['Client_1', 'Client_2', 'Client_3'], dtype='object')

#### Get the values of attribute 'color' for all objects: using `.`

In [13]:
df.color

0    green
1      red
2     blue
Name: color, dtype: object

#### Get the values of attribute 'color' for all objects: by slicing with the `key`

In [14]:
df['color']

0    green
1      red
2     blue
Name: color, dtype: object

#### Get the values of attribute 'color' for all objects: using `iloc`
(it works by index/position)

*Indexing both axes*

In [15]:
df.iloc[:,0]

0    green
1      red
2     blue
Name: color, dtype: object

#### Get the values of attribute 'color' for all objects: using `loc`
(it works by labels/names)

**Indexing both axes**

In [16]:
df.loc[:,'color']

0    green
1      red
2     blue
Name: color, dtype: object

#### Get value at specified row/column pair

In [17]:
df.at[2,'color']

'blue'

#### Set value at specified row/column pair

In [18]:
df.at[2, 'color'] = 'black'

#### Get a Series for an object

In [19]:
df.loc[2]

color          black
size               L
price           15.3
brand         brandA
classlabel    class1
Name: 2, dtype: object

#### Get value within a Series: with `at`

In [20]:
df.loc[2].at['color']
df.loc[2].at['size']

'black'

'L'

#### Get value within a Series: with `iat`

In [21]:
df.loc[2].iat[0]
df.loc[2].iat[1]

'black'

'L'

#### Get the shape of a DataFrame

In [22]:
df
df.shape

Unnamed: 0,color,size,price,brand,classlabel
0,green,M,10.1,brandA,class1
1,red,XL,13.5,brandB,class2
2,black,L,15.3,brandA,class1


(3, 5)

#### Get the number of objects of a DataFrame

In [23]:
len(df.index)

3

#### Get the number of attributes of a DataFrame

In [24]:
len(df.columns)

5

#### Get the shape of a DataFrame

In [25]:
df
df.shape

Unnamed: 0,color,size,price,brand,classlabel
0,green,M,10.1,brandA,class1
1,red,XL,13.5,brandB,class2
2,black,L,15.3,brandA,class1


(3, 5)

#### Drop a DataFrame column

In [26]:
df = df.drop(['brand'], axis=1)
df

Unnamed: 0,color,size,price,classlabel
0,green,M,10.1,class1
1,red,XL,13.5,class2
2,black,L,15.3,class1


#### Concatenate pandas objects along a particular axis

In [27]:
df_3 = df.copy()
df_4 = pd.DataFrame({'brand': ['brandA', 'brandB', 'brandA']})
df_5 = pd.concat([df_3, df_4], axis=1)
df_5

Unnamed: 0,color,size,price,classlabel,brand
0,green,M,10.1,class1,brandA
1,red,XL,13.5,class2,brandB
2,black,L,15.3,class1,brandA


#### Insert column into DataFrame at specified location

In [28]:
df.insert(3, "brand", ['brandA', 'brandB', 'brandA'])
df


Unnamed: 0,color,size,price,brand,classlabel
0,green,M,10.1,brandA,class1
1,red,XL,13.5,brandB,class2
2,black,L,15.3,brandA,class1


In [14]:
df.insert (3, "brand", ['brandA', 'brandB', 'brandA'])
df
#dá erro porque já existe a coluna brand

ValueError: cannot insert brand, already exists

#### Verify if a Dataframe has empy entries

In [29]:
df.empty

False