In [None]:
---
title: "Data Manipulation"
execute:
    echo: true
    eval: true
---

# Data Manipulations {.unnumbered}

### Array manipulation {.unnumbered}

Often data has to be manipulated before it can be analyzed. <br> 
Numpy has many methods for array manipulation. <br> 
For example, the shape of an array can be changed, multiple arrays can be concatenated, <br> 
the elements of an array can be sorted, non valid values NaN can be removed, <br> linear algebra operations can be performed, etc.

#### Sort arrays: {.unnumbered}

In [None]:

unsorted_arr = np.array([[3, 1, 5, 2, 4], [5, 2, 0, 8, 1], [3, 2, 9, 4 , 5]])
print("unsortd_arr: \n", unsorted_arr)
sorted_arr = np.sort(unsorted_arr,axis = 0) # sorted along the column
print("sorted_arr along columns: \n" , sorted_arr)
sorted_arr = np.sort(unsorted_arr,axis = 1) # sorted along the row
print("sorted_arr along rows: \n" , sorted_arr)

unsortd_arr: 
 [[3 1 5 2 4]
 [5 2 0 8 1]
 [3 2 9 4 5]]
sorted_arr along columns: 
 [[3 1 0 2 1]
 [3 2 5 4 4]
 [5 2 9 8 5]]
sorted_arr along rows: 
 [[1 2 3 4 5]
 [0 1 2 5 8]
 [2 3 4 5 9]]


#### Concatenate arrays: {.unnumbered}

In [None]:
a = np.array([1, 2, 3]) 
b = np.array([4, 5, 6])
c = np.concatenate((a, b))
print(c)

[1 2 3 4 5 6]


#### Reshape arrays: {.unnumbered}

In [None]:
a = np.array([[1, 2],[3, 4],[5, 6]])
b = np.array([[7, 8],[9, 10],[11, 12]])
c = np.vstack((a, b)) # vertical stack
d = np.hstack((a, b)) # horizontal stack
print("vertical stack: \n", c)
print("horizontal stack: \n", d)
print("flatten array: \n", c.flatten())

vertical stack: 
 [[ 1  2]
 [ 3  4]
 [ 5  6]
 [ 7  8]
 [ 9 10]
 [11 12]]
horizontal stack: 
 [[ 1  2  7  8]
 [ 3  4  9 10]
 [ 5  6 11 12]]
flatten array: 
 [ 1  2  3  4  5  6  7  8  9 10 11 12]


#### Linear algebra operations *e.g.*: {.unnumbered}


- ``np.sum(a)`` returns **element-wise sum** of two arrays
- ``np.add(a)`` returns **sum of all elements** an arrays
- ``np.subtract(a,b)`` returns the **difference** of two arrays
- ``np.multiply(a,b)`` returns the **product** of two arrays
- ``np.divide(a,b)`` returns the **division** of two arrays
- ``np.dot(a,b)`` returns the **dot product** of two arrays
- ``np.cross(a,b)`` returns the **cross product** of two arrays
- ``np.linalg.inv(a)`` returns the **inverse of a matrix**
- ``np.linalg.det(a)`` returns the **determinant of a matrix**
- ``np.linalg.eig(a)`` returns the **eigenvalues and eigenvectors** of a matrix
- ``np.linalg.solve(a,b)`` returns the **solution of a linear system of equations**
- ``np.mean(a)`` returns the **mean** of the elements of an array
- ``np.average(a)`` returns the **weighted average** of the elements of an array
- ``np.max(a)`` returns the **maximum** of the elements of an array
- ``np.min(a)`` returns **the minimum** of the elements of an array
- ``np.std(a)`` returns **the standard** deviation of the elements of an array
- ``np.var(a)`` returns **the variance** of the elements of an array
- ``np.covar(a)`` returns the **covariance** of the elements of an array
- ``np.median(a)`` returns the **median** of the elements of an array
- ``np.percentile(a,p) ``returns the **p-th percentile** of the elements of an array
- ``np.histogram(a, [,bins,range,density,weights])`` returns the **histogram** <br>
     of the elements of an array


In [None]:
a = np.array([[1, 2],[3, 4],[5, 6]])
b = np.array([[7, 8],[9, 10],[11, 12]])
c = np.sum(a) # sum of all elements
print("Summ of all elements: ", c)
c = np.add(a,b) # element wise addition
print("Element wise addition: ", c)



Summ of all elements:  21
Element wise addition:  [[ 8 10]
 [12 14]
 [16 18]]


- ``np.argmin(a)`` returns the **index of the minimum** element of an array
- ``np.argmax(a)`` returns the **index of the maximum** element of an array
- ``np.where(a)`` returns the **indices of the elements** of an array that are non-zero
- ``np.argwhere(a)`` returns the **indices of the elements** of an array that are non-zero
- ``np.nonzero(a)`` returns the **indices of the elements** of an array that are non-zero
- ``np.searchsorted(a,v)`` returns the **index of the first element** of an array <br> 
    **that is greater than or equal to a value**
- ``np.extract(condition,a)`` returns the **elements of an array that satisfy a condition**

In [None]:
arr = np.array([1, 2, 3, 4, 5, 4, 7, 8, 9])
print(arr)
print("Index of the maximum element: ", np.argmax(arr))
print("Index of the minimum element: ", np.argmin(arr))
print("Maximum element: ", np.max(arr))
print("Minimum element: ", np.min(arr))
print("Find the index of element \"4\": ", np.where(arr == 4))
print("Find the index of element \">4\": ", np.argwhere(arr > 4),
       " and the elements are: ", arr[np.where(arr > 4)])


[1 2 3 4 5 4 7 8 9]
Index of the maximum element:  8
Index of the minimum element:  0
Maximum element:  9
Minimum element:  1
Find the index of element "4":  (array([3, 5]),)
Find the index of element ">4":  [[4]
 [6]
 [7]
 [8]]  and the elements are:  [5 7 8 9]


### Get information about the data {.unnumbered}

#### Select subsets of the data: {.unnumbered}

In [None]:
print("head of data: \n",data.head()) # print the first 5 rows
print("\n")
print("Number of data set: \n", data.shape[0]) # number of data set
print("\n")
print("Column name: \n", data["name"]) # print the column "name"
print("\n")
print("2. Row: \n", data.iloc[1]) # print the second row




head of data: 
     name  age      job      city
0   John   23  student  New York
1   Anna   36  teacher     Paris
2  Peter   45  student    Berlin
3  Linda   32  student    London
4  Maria   60  teacher     Paris


Number of data set: 
 5


Column name: 
 0     John
1     Anna
2    Peter
3    Linda
4    Maria
Name: name, dtype: object


2. Row: 
 name       Anna
age          36
job     teacher
city      Paris
Name: 1, dtype: object


In [None]:
# filter the rows where age is greater than 30
print("Filter the rows where age is greater than 30: \n", data[data["age"] > 30]) 
print("\n")
# get the city of that people where are older than 40
print("Get the city of that people where are older than 40: \n", 
      data.loc[data["age"]>40,"city"])
print("\n")

Filter the rows where age is greater than 30: 
     name  age      job    city
1   Anna   36  teacher   Paris
2  Peter   45  student  Berlin
3  Linda   32  student  London
4  Maria   60  teacher   Paris


Get the city of that people where are older than 40: 
 2    Berlin
4     Paris
Name: city, dtype: object




### Manipulate data {.unnumbered}

You can copy data, drop columns, drop rows, fill not valid values, replace values, merge data, join data, etc.

In [None]:
data2 = data.copy()
data2["age"] = data2["age"] + 10 # increase the age by 10
print("data2: \n" , data2)

data2.index = range(5,10) # change the index
data3 = pd.concat([data, data2]) # concatenate the dataframes
print("concated data: \n", data3)

data2: 
     name  age      job      city
0   John   33  student  New York
1   Anna   46  teacher     Paris
2  Peter   55  student    Berlin
3  Linda   42  student    London
4  Maria   70  teacher     Paris
concated data: 
     name  age      job      city
0   John   23  student  New York
1   Anna   36  teacher     Paris
2  Peter   45  student    Berlin
3  Linda   32  student    London
4  Maria   60  teacher     Paris
5   John   33  student  New York
6   Anna   46  teacher     Paris
7  Peter   55  student    Berlin
8  Linda   42  student    London
9  Maria   70  teacher     Paris


In [None]:
 # create a Panda Series
hobby = pd.Series(["football", "chess", "swimming", "reading",
                    "dancing", "skiing", "reading", "football", 
                    "cooking", "swimming", "chess"], name="hobby")
print("hobby: \n",hobby)
# concatenate the dataframes
data4 = pd.concat([data3, hobby], axis=1) 
print("concated data: \n", data4)

hobby: 
 0     football
1        chess
2     swimming
3      reading
4      dancing
5       skiing
6      reading
7     football
8      cooking
9     swimming
10       chess
Name: hobby, dtype: object
concated data: 
      name   age      job      city     hobby
0    John  23.0  student  New York  football
1    Anna  36.0  teacher     Paris     chess
2   Peter  45.0  student    Berlin  swimming
3   Linda  32.0  student    London   reading
4   Maria  60.0  teacher     Paris   dancing
5    John  33.0  student  New York    skiing
6    Anna  46.0  teacher     Paris   reading
7   Peter  55.0  student    Berlin  football
8   Linda  42.0  student    London   cooking
9   Maria  70.0  teacher     Paris  swimming
10    NaN   NaN      NaN       NaN     chess


In [None]:
data5 = data4.drop_duplicates() # drop the duplicates
print("data5: \n", data5)
data6 = data5.dropna() # drop the NaNs
print("data6: \n", data6)

data5: 
      name   age      job      city     hobby
0    John  23.0  student  New York  football
1    Anna  36.0  teacher     Paris     chess
2   Peter  45.0  student    Berlin  swimming
3   Linda  32.0  student    London   reading
4   Maria  60.0  teacher     Paris   dancing
5    John  33.0  student  New York    skiing
6    Anna  46.0  teacher     Paris   reading
7   Peter  55.0  student    Berlin  football
8   Linda  42.0  student    London   cooking
9   Maria  70.0  teacher     Paris  swimming
10    NaN   NaN      NaN       NaN     chess
data6: 
     name   age      job      city     hobby
0   John  23.0  student  New York  football
1   Anna  36.0  teacher     Paris     chess
2  Peter  45.0  student    Berlin  swimming
3  Linda  32.0  student    London   reading
4  Maria  60.0  teacher     Paris   dancing
5   John  33.0  student  New York    skiing
6   Anna  46.0  teacher     Paris   reading
7  Peter  55.0  student    Berlin  football
8  Linda  42.0  student    London   cooking
9 

### Transform Pandas DataFrame to Numpy Array and viceversa {.unnumbered}

In [None]:
df = data6
print(df)
# convert the dataframe to a numpy array
arr = df.to_numpy() 
print("arr: \n", arr)
# convert the numpy array to a dataframe
df2 = pd.DataFrame(arr, columns=["name", "age", "job", "city", "hobby"]) 
print("df2: \n", df2)

    name   age      job      city     hobby
0   John  23.0  student  New York  football
1   Anna  36.0  teacher     Paris     chess
2  Peter  45.0  student    Berlin  swimming
3  Linda  32.0  student    London   reading
4  Maria  60.0  teacher     Paris   dancing
5   John  33.0  student  New York    skiing
6   Anna  46.0  teacher     Paris   reading
7  Peter  55.0  student    Berlin  football
8  Linda  42.0  student    London   cooking
9  Maria  70.0  teacher     Paris  swimming
arr: 
 [['John' 23.0 'student' 'New York' 'football']
 ['Anna' 36.0 'teacher' 'Paris' 'chess']
 ['Peter' 45.0 'student' 'Berlin' 'swimming']
 ['Linda' 32.0 'student' 'London' 'reading']
 ['Maria' 60.0 'teacher' 'Paris' 'dancing']
 ['John' 33.0 'student' 'New York' 'skiing']
 ['Anna' 46.0 'teacher' 'Paris' 'reading']
 ['Peter' 55.0 'student' 'Berlin' 'football']
 ['Linda' 42.0 'student' 'London' 'cooking']
 ['Maria' 70.0 'teacher' 'Paris' 'swimming']]
df2: 
     name   age      job      city     hobby
0   John  