# The Basics of NumPy and Pandas

**Numpy** is the core library for scientific computing in Python. It provides a high-performance multidimensional **array object**, and tools for working with these arrays. If you are already familiar with MATLAB, you might find this [tutorial](http://wiki.scipy.org/NumPy_for_Matlab_Users) useful to get started with Numpy.

###！！！重要 Why use Numpy Arrays instead of Lists?


*   Data types: Arrays in NumPy are **homogeneous**, meaning all elements must be of the same data type (e.g., integers, floats), whereas lists can contain elements of different data types.

*   Memory efficiency: Arrays are more **memory efficient** compared to lists because they store data in a contiguous block of memory. Lists, on the other hand, store references to objects in memory, which can result in more memory overhead.

*   Performance: Operations on arrays are generally faster and more efficient than operations on lists, especially for large datasets, due to **NumPy's implementation in C** and optimized algorithms.

*   Functionality: NumPy arrays come with a wide range of **built-in functions and methods** for mathematical operations, linear algebra, statistical analysis, and more. Lists have a more limited set of built-in functions and methods.




To use Numpy, we first need to import the numpy package. By convention, we import it using the alias np. Then, when we want to use modules or methods in this library, we preface them with np.



In [3]:
import numpy as np

### Arrays and array construction

A numpy array is a grid of values, all of the same type, and is indexed by a tuple of nonnegative integers. The number of dimensions is the rank of the array; the shape of an array is a tuple of integers giving the size of the array along each dimension.

We can create a `numpy` array by passing a Python list to `np.array()`.

In [2]:
a = np.array([1, 2, 3])  # Create a rank 1 array

This creates the array we can see on the right here:

![](http://jalammar.github.io/images/numpy/create-numpy-array-1.png)

In [3]:
print(type(a), a.shape, a[0], a[1], a[2])
a[0] = 5                 # Change an element of the array
print(a)

<class 'numpy.ndarray'> (3,) 1 2 3
[5 2 3]


To create a `numpy` array with more dimensions, we can pass nested lists, like this:

![](http://jalammar.github.io/images/numpy/numpy-array-create-2d.png)

![](http://jalammar.github.io/images/numpy/numpy-3d-array.png)

In [18]:
b = np.array([[1,2],[3,4]])   # Create a rank 2 array
print(b)
print(b.shape)

[[1 2]
 [3 4]]
(2, 2)


In [23]:
c=np.array([[[1,2],[3,4]],[[5,6],[7,8]]])
print(c)
print(c.shape)
###！！！重要

[[[1 2]
  [3 4]]

 [[5 6]
  [7 8]]]
(2, 2, 2)


!!!重要 There are often cases when we want numpy to initialize the values of the array for us. numpy provides methods like `ones()`, `zeros()`, and `random.random()` for these cases. We just pass them the number of elements we want it to generate:

![](http://jalammar.github.io/images/numpy/create-numpy-array-ones-zeros-random.png)

Sometimes, we need an array of a specific shape with “placeholder” values that we plan to fill in with the result of a computation. The `zeros` or `ones` functions are handy for this:

In [16]:
a = np.ones(3)  # Create an array of all ones
print(a)
b=np.random.random(3)
print(b)

[1. 1. 1.]
[0.2001977  0.87264582 0.63624624]


We can also use these methods to produce multi-dimensional arrays, as long as we pass them a tuple describing the dimensions of the matrix we want to create:

![](http://jalammar.github.io/images/numpy/numpy-matrix-ones-zeros-random.png)



In [17]:
a = np.zeros((2,2))  # Create an array of all zeros
print(a)
b = np.ones((1,2))   # Create an array of all ones
print(b)
c = np.random.random((2,2)) # Create an array filled with random values
print(c)

[[0. 0.]
 [0. 0.]]
[[1. 1.]]
[[0.33328177 0.56352009]
 [0.18909052 0.17994064]]


！！！重要 numpy also has two useful functions for creating sequences of numbers: `arange` and `linspace`.

The `arange` function accepts three arguments, which define the start value, stop value of a half-open interval, and step size. (The default step size, if not explicitly specified, is 1; the default start value, if not explicitly specified, is 0.)

The `linspace` function is similar, but we can specify the number of values instead of the step size, and it will create a sequence of evenly spaced values.

In [None]:
f = np.arange(10,50,5)   # Create an array of values starting at 10 in increments of 5
print(f)

Note this ends on 45, not 50 (does not include the top end of the interval).

In [18]:
g = np.linspace(0., 1., num=5)
print(g)

[0.   0.25 0.5  0.75 1.  ]


In [24]:
# Using linspace
linspace_array = np.linspace(0, 10, 5)
print("Linspace Array:", linspace_array)

# Using arange
arange_array = np.arange(0, 10, 2)
print("Arange Array:", arange_array)

Linspace Array: [ 0.   2.5  5.   7.5 10. ]
Arange Array: [0 2 4 6 8]


！！！ 重要 Sometimes, we may want to construct an array from existing arrays by “stacking” the existing arrays, either vertically or horizontally. We can use `vstack()` (or `row_stack`) and `hstack()` (or `column_stack`), respectively.
注意：这里V- vertical（竖直的）实际是水平的； V和H互换概念了

In [19]:
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
np.vstack((a,b))

array([[1, 2, 3],
       [4, 5, 6]])

In [20]:
a = np.array([[7], [8], [9]])
b = np.array([[4], [5], [6]])
np.hstack((a,b))

array([[7, 4],
       [8, 5],
       [9, 6]])

## NumPy Array Attributes

In [22]:
x1=np.array([5,6,7,8,9])
print(x1.shape)
print(x1)

(5,)
[5 6 7 8 9]


！！！重要 Each array has attributes ``ndim`` (the number of dimensions), ``shape`` (the size of each dimension), and ``size`` (the total size of the array):

In [63]:
print("x1 ndim: ", x1.ndim)
print("x1 shape:", x1.shape)
x2 = np.random.random((3, 4))  # Two-dimensional array
print("x2 size: ", x2.size)

x1 ndim:  1
x1 shape: (5,)
x2 size:  12


####Datatypes

Every numpy array is a grid of elements of the same type. Numpy provides a large set of numeric datatypes that you can use to construct arrays. Numpy tries to guess a datatype when you create an array, but functions that construct arrays usually also include an optional argument to explicitly specify the datatype. Here is an example:

In [59]:
x = np.array([1, 2])  # Let numpy choose the datatype
y = np.array([1.0, 2.0])  # Let numpy choose the datatype
z = np.array([1, 2], dtype=np.int64)  # Force a particular datatype

print(x.dtype, y.dtype, z.dtype)

int64 float64 int64


You can read all about numpy datatypes in the [documentation](http://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html).

## Array Indexing: Accessing Single Elements

If you are familiar with Python's standard list indexing, indexing in NumPy will feel quite familiar.
In a one-dimensional array, the $i^{th}$ value (counting from zero) can be accessed by specifying the desired index in square brackets, just as with Python lists:

In [28]:
x1

array([5, 6, 7, 8, 9])

In [29]:
x1[0]

5

In [30]:
x1[4]

9

To index from the end of the array, you can use negative indices:

In [31]:
x1[-1]

9

In [32]:
x1[-2]

8

In a multi-dimensional array, items can be accessed using a comma-separated tuple of indices:

In [64]:
x2

array([[0.21346156, 0.03607316, 0.0790533 , 0.06448956],
       [0.32257563, 0.74177212, 0.8598433 , 0.24521068],
       [0.80397453, 0.79898494, 0.17202231, 0.24085685]])

In [65]:
x2[0, 0]

0.21346156374833292

In [66]:
x2[2, 0]

0.8039745345120638

In [67]:
x2[2, -1]

0.2408568507908938

Values can also be modified using any of the above index notation:

In [68]:
x2[0, 0] = 12
x2

array([[12.        ,  0.03607316,  0.0790533 ,  0.06448956],
       [ 0.32257563,  0.74177212,  0.8598433 ,  0.24521068],
       [ 0.80397453,  0.79898494,  0.17202231,  0.24085685]])

！！！重要 Keep in mind that, unlike Python lists, NumPy arrays have a fixed type.
This means, for example, that if you attempt to insert a floating-point value to an integer array, the value will be silently truncated. Don't be caught unaware by this behavior!

In [38]:
print(x1)
x1[0] = 3.14159  # this will be truncated!
x1

[5 6 7 8 9]


array([3, 6, 7, 8, 9])

## Array Slicing: Accessing Subarrays

Just as we can use square brackets to access individual array elements, we can also use them to access subarrays with the *slice* notation, marked by the colon (``:``) character.
The NumPy slicing syntax follows that of the standard Python list; to access a slice of an array ``x``, use this:
``` python
x[start:stop:step]
```
If any of these are unspecified, they default to the values ``start=0``, ``stop=``*``size of dimension``*, ``step=1``.
We'll take a look at accessing sub-arrays in one dimension and in multiple dimensions.

### One-dimensional subarrays

In [25]:
x = np.arange(20)
x

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])

In [26]:
x[:5]  # first five elements

array([0, 1, 2, 3, 4])

In [27]:
x[5:]  # elements after index 5

array([ 5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19])

In [28]:
x[4:7]  # middle sub-array

array([4, 5, 6])

In [29]:
x[::2]  # every other element  # What if I want odd numbers?

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])

A potentially confusing case is when the ``step`` value is negative.
In this case, the defaults for ``start`` and ``stop`` are swapped.
This becomes a convenient way to reverse an array:

In [46]:
x[::-1]  # all elements, reversed

array([19, 18, 17, 16, 15, 14, 13, 12, 11, 10,  9,  8,  7,  6,  5,  4,  3,
        2,  1,  0])

### Multi-dimensional subarrays

Multi-dimensional slices work in the same way, with multiple slices separated by commas.
For example:

In [36]:
x2 = np.random.randint(10, size=(3, 4))  # Two-dimensional array
x2


array([[3, 3, 5, 2],
       [3, 9, 7, 2],
       [3, 4, 3, 8]])

In [37]:
x2[:2, :3]  # two rows, three columns

array([[3, 3, 5],
       [3, 9, 7]])

In [38]:
x2[:3, ::2]  # all rows, every other column

array([[3, 5],
       [3, 7],
       [3, 3]])

Finally, subarray dimensions can even be reversed together:

In [39]:
x2[::-1, ::-1]

array([[8, 3, 4, 3],
       [2, 7, 9, 3],
       [2, 5, 3, 3]])

#### Accessing array rows and columns

One commonly needed routine is accessing of single rows or columns of an array.
This can be done by combining indexing and slicing, using an empty slice marked by a single colon (``:``):

In [40]:
print(x2[:, 0])  # first column of x2

[3 3 3]


In [41]:
print(x2[0, :])  # first row of x2

[3 3 5 2]


In the case of row access, the empty slice can be omitted for a more compact syntax:

In [42]:
print(x2[0])  # equivalent to x2[0, :]

[3 3 5 2]


### Subarrays as no-copy views

One important–and extremely useful–thing to know about array slices is that they return *views* rather than *copies* of the array data.
This is one area in which NumPy array slicing differs from Python list slicing: in lists, slices will be copies.
Consider our two-dimensional array from before:

In [43]:
print(x2)

[[3 3 5 2]
 [3 9 7 2]
 [3 4 3 8]]


Let's extract a $2 \times 2$ subarray from this:

In [44]:
x2_sub = x2[:2, :2]
print(x2_sub)

[[3 3]
 [3 9]]


Now if we modify this subarray, we'll see that the original array is changed! Observe:

In [7]:
x2_sub[0, 0] = 99
print(x2_sub)

[[99  3]
 [ 6  4]]


In [8]:
print(x2)

[[99  3  9  0]
 [ 6  4  9  7]
 [ 0  9  5  0]]


！！！重要 This default behavior is actually quite useful: it means that when we work with large datasets, we can access and process pieces of these datasets without the need to copy the underlying data buffer.
- array 部分的改变也会改变 original one 

### Creating copies of arrays

Despite the nice features of array views, it is sometimes useful to instead explicitly copy the data within an array or a subarray. This can be most easily done with the ``copy()`` method:

In [9]:
x2_sub_copy = x2[:2, :2].copy()
print(x2_sub_copy)

[[99  3]
 [ 6  4]]


If we now modify this subarray, the original array is not touched:

In [10]:
x2_sub_copy[0, 0] = 42
print(x2_sub_copy)

[[42  3]
 [ 6  4]]


In [11]:
print(x2)

[[99  3  9  0]
 [ 6  4  9  7]
 [ 0  9  5  0]]


## Reshaping of Arrays

Another useful type of operation is reshaping of arrays.
The most flexible way of doing this is with the ``reshape`` method.
For example, if you want to put the numbers 1 through 9 in a $3 \times 3$ grid, you can do the following:

In [14]:
grid = np.arange(1, 10).reshape((3,3))
print(grid)

[[1 2 3]
 [4 5 6]
 [7 8 9]]


In [15]:
grid = np.arange(1,10)
print(grid)
grid = grid.reshape((3,3))
print(grid)

[1 2 3 4 5 6 7 8 9]
[[1 2 3]
 [4 5 6]
 [7 8 9]]


Note that for this to work, the size of the initial array must match the size of the reshaped array.

！！！Another common reshaping pattern is the conversion of a one-dimensional array into a two-dimensional row or column matrix.
This can be done with the ``reshape`` method, or more easily done by making use of the ``newaxis`` keyword within a slice operation:

In [45]:
x = np.array([1, 2, 3])
# row vector via reshape
print(x.shape)
print(x)
x = x.reshape((3,1))
print(x.shape)
print(x)

(3,)
[1 2 3]
(3, 1)
[[1]
 [2]
 [3]]


In [46]:
x = np.array([1, 2, 3])
# row vector via newaxis
x = x[np.newaxis, :]   #It adds an axis of length 1 to the array, effectively increasing its dimensionality.
print(x)
print(x.shape)

[[1 2 3]]
(1, 3)


In [47]:
x = np.array([1, 2, 3])
# column vector via reshape
x.reshape((3, 1))

array([[1],
       [2],
       [3]])

In [48]:
# column vector via newaxis
x = np.array([1, 2, 3])
x=x[:, np.newaxis]
print(x.shape)

(3, 1)


np.newaxis: This is used as an index to add an axis to an array. np.newaxis is essentially None, and it can be used interchangeably in this context. When you index an array with np.newaxis, you are asking NumPy to increase the dimension of the array by one at that index position.

x[np.newaxis, :]: This specific use adds a new axis at the first dimension, turning a one-dimensional array into a two-dimensional array where the first dimension is 1, and the second dimension is the length of the original array.

For example, if x is a one-dimensional array with a shape (n,), where n is the number of elements in x, then after applying x = x[np.newaxis, :], the shape of x will become (1, n). This effectively makes x a row vector.

You will come across such transformations throughout any data science problem.


## Array Concatenation and Splitting

All of the preceding routines worked on single arrays. It's also possible to combine multiple arrays into one, and to conversely split a single array into multiple arrays. We'll take a look at those operations here.

### Concatenation of arrays

Concatenation, or joining of two arrays in NumPy, is primarily accomplished using the routines ``np.concatenate``, ``np.vstack``, and ``np.hstack``.
``np.concatenate`` takes a tuple or list of arrays as its first argument, as we can see here:

In [None]:
x = np.array([1, 2, 3])
y = np.array([3, 2, 1])
np.concatenate([x, y])

array([1, 2, 3, 3, 2, 1])

You can also concatenate more than two arrays at once:

In [None]:
z = [99, 99, 99]
print(np.concatenate([x, y, z]))

[ 1  2  3  3  2  1 99 99 99]


It can also be used for two-dimensional arrays:

In [None]:
grid = np.array([[1, 2, 3],
                 [4, 5, 6]])

print(grid.shape)

(2, 3)


In [None]:
# concatenate along the first axis
new_grid = np.concatenate([grid, grid], axis=0)
new_grid.shape

(4, 3)

In [None]:
# concatenate along the second axis (zero-indexed)
new_grid_col = np.concatenate([grid, grid], axis=1)
new_grid_col.shape

(2, 6)

Numpy also provides many useful functions for performing computations on arrays, such as `min()`, `max()`, `sum()`, and others:

![](http://jalammar.github.io/images/numpy/numpy-matrix-aggregation-1.png)

In [35]:
data = np.array([[1, 2], [3, 4], [5, 6]])

print(np.max(data))  # Compute max of all elements; prints "6"
print(np.min(data))  # Compute min of all elements; prints "1"
print(np.sum(data))  # Compute sum of all elements; prints "21"

6
1
21


Not only can we aggregate all the values in a matrix using these functions, but we can also aggregate across the rows or columns by using the `axis` parameter:

![](http://jalammar.github.io/images/numpy/numpy-matrix-aggregation-4.png)

In [36]:
data = np.array([[1, 2], [5, 3], [4, 6]])

print(np.max(data, axis=0))  # Compute max of each column; prints "[5 6]"
print(np.max(data, axis=1))  # Compute max of each row; prints "[2 5 6]"

[5 6]
[2 5 6]


axis=0：操作会跨越行（即，垂直方向上），对每一列进行操作。
axis=1：操作会跨越列（即，水平方向上），对每一行进行操作。

Let's look at a practical example using the numpy attributes we just discussed 💻

## Practice Questions for numpy
1. Define two custom numpy arrays, say A and B. Generate two new numpy arrays by stacking A and B vertically and horizontally.
2. Find common elements between A and B. [Hint : Intersection of two sets]
3. Extract all numbers from A which are within a specific range. eg between 5 and 10. [Hint: np.where() might be useful or boolean masks]
4. Filter the rows of iris_2d that has petallength (3rd column) > 1.5 and sepallength (1st column) < 5.0
```
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris_2d = np.genfromtxt(url, delimiter=',', dtype='float', usecols=[0,1,2,3])
```

In [60]:
A=np.array([1,2,3,4])
print(B.shape)
B=np.array([5,6,7,8])

(4,)


array([1, 2, 3, 4, 5, 6, 7, 8])

In [63]:
#Vertically- axis=0 表示along the row 在每个column上操作 也就是vertical
A2=A[np.newaxis,:]
B2=B[np.newaxis,:]
z2=np.concatenate([A2,B2],axis=0)
print(A2.shape)
z2

(1, 4)


array([[1, 2, 3, 4],
       [5, 6, 7, 8]])

In [64]:
#Horizontally
z=np.concatenate([A2,B2],axis=1)
z

array([[1, 2, 3, 4, 5, 6, 7, 8]])

不能直接concatenate one- dimensional array:

Specifically, this error means you're trying to concatenate along axis=1 (columns), but one or both of the arrays A and B are one-dimensional. In a one-dimensional array, there's only axis=0, and axis=1 does not exist.

- You can turn the one-dimensional arrays into two-dimensional row vectors before concatenation. Use np.newaxis to add an extra dimension

In [66]:
#2
common_ele=np.intersect1d(A,B)
common_ele

array([], dtype=int64)

In [68]:
#3
lower_end=5
upper_end=10
condition=(A>=lower_end) & (A<=upper_end)
indices=np.where(condition)
withinrange=A[indices]
withinrange

array([], dtype=int64)

In [69]:
condition2=(B>=lower_end) & (B<=upper_end)
indices2=np.where(condition2)
withinrangeb=B[indices2]
withinrangeb

array([5, 6, 7, 8])

In [70]:
#4Filter the rows of iris_2d that has petallength (3rd column) > 1.5 and sepallength (1st column) < 5.0
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris_2d = np.genfromtxt(url, delimiter=',', dtype='float', usecols=[0,1,2,3])

In [71]:
condition3=(iris_2d[:,2] >1.5) & (iris_2d[:,0]<5.0)
filter_rows=iris_2d[condition3]
filter_rows

array([[4.8, 3.4, 1.6, 0.2],
       [4.8, 3.4, 1.9, 0.2],
       [4.7, 3.2, 1.6, 0.2],
       [4.8, 3.1, 1.6, 0.2],
       [4.9, 2.4, 3.3, 1. ],
       [4.9, 2.5, 4.5, 1.7]])

In [None]:
## Optional Practice Question

#Find the mean of a numeric column grouped by a categorical column in a 2D numpy array

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris = np.genfromtxt(url, delimiter=',', dtype='object')
names = ('sepallength', 'sepalwidth', 'petallength', 'petalwidth', 'species')


numeric_column = iris[:, 1].astype('float')  # sepalwidth
grouping_column = iris[:, 4]  # species

output = []
"""Your code goes here"""

output

## Starting with Pandas

Pandas is a powerful and popular open-source Python library used for 
!!! 重要 data manipulation (cleaning, filtering, sorting, reshaping, restructuring, aggregating, joining) and analysis. It provides data structures and functions designed to make working with structured (tabular) data easy and intuitive.

Check out the [documentation](https://pandas.pydata.org/docs/reference/index.html) as you code.

In [3]:
#we start from the very basics...import!

import pandas as pd
import numpy as np

The **DataFrame** is a two-dimensional labeled data structure with columns of potentially different types, similar to a spreadsheet or SQL table. It provides powerful indexing, slicing, and reshaping capabilities, making it easy to manipulate and analyze data.

## PART 1: Getting and Knowing your Data


In [72]:
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv'

chipo = pd.read_csv(url, sep='\t')

### See the first 10 entries

In [66]:
chipo.head(10)  #Returns the first 10 rows

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
0,1,1,Chips and Fresh Tomato Salsa,,$2.39
1,1,1,Izze,[Clementine],$3.39
2,1,1,Nantucket Nectar,[Apple],$3.39
3,1,1,Chips and Tomatillo-Green Chili Salsa,,$2.39
4,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",$16.98
5,3,1,Chicken Bowl,"[Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou...",$10.98
6,3,1,Side of Chips,,$1.69
7,4,1,Steak Burrito,"[Tomatillo Red Chili Salsa, [Fajita Vegetables...",$11.75
8,4,1,Steak Soft Tacos,"[Tomatillo Green Chili Salsa, [Pinto Beans, Ch...",$9.25
9,5,1,Steak Burrito,"[Fresh Tomato Salsa, [Rice, Black Beans, Pinto...",$9.25


### ！！！ 重要 Print the last elements of the data set.

In [73]:
chipo.tail()   #Returns the last 5 rows

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
4617,1833,1,Steak Burrito,"[Fresh Tomato Salsa, [Rice, Black Beans, Sour ...",$11.75
4618,1833,1,Steak Burrito,"[Fresh Tomato Salsa, [Rice, Sour Cream, Cheese...",$11.75
4619,1834,1,Chicken Salad Bowl,"[Fresh Tomato Salsa, [Fajita Vegetables, Pinto...",$11.25
4620,1834,1,Chicken Salad Bowl,"[Fresh Tomato Salsa, [Fajita Vegetables, Lettu...",$8.75
4621,1834,1,Chicken Salad Bowl,"[Fresh Tomato Salsa, [Fajita Vegetables, Pinto...",$8.75


### What is the number of observations in the dataset?

In [5]:
chipo.shape    #Return a tuple representing the dimensionality of the DataFrame.   #How to access no of rows and columns?

(4622, 5)

### Another way

In [69]:
chipo.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4622 entries, 0 to 4621
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   order_id            4622 non-null   int64 
 1   quantity            4622 non-null   int64 
 2   item_name           4622 non-null   object
 3   choice_description  3376 non-null   object
 4   item_price          4622 non-null   object
dtypes: int64(2), object(3)
memory usage: 180.7+ KB


### What is the number of columns in the dataset?

In [70]:
chipo.shape[1]

5

### What are the different columns in our dataset?

In [44]:
chipo.columns

Index(['order_id', 'quantity', 'item_name', 'choice_description',
       'item_price'],
      dtype='object')

In [71]:
chipo.index

RangeIndex(start=0, stop=4622, step=1)

### How many items were orderd in total?

In [72]:
total_items_orders = chipo.quantity.sum()
total_items_orders

4972

### Check the item price type

In [73]:
chipo.item_price.dtype
# It is a python object

dtype('O')

How much was the revenue for the period in the dataset?

In [79]:
chipo['item_price'] = chipo['item_price'].str[1:]
### 这行代码的作用是处理一个名为 chipo 的Pandas DataFrame，
### 特别是它的 item_price 列。代码假定 item_price 列中的数据是以字符串格式存储，
### 并且每个价格前都有一个货币符号（比如美元符号$）。通过使用 .str[1:]，代码移除了每个价格字符串的第一个字符，通常是这个货币符号，从而留下了纯数字部分。

chipo['item_price'] = pd.to_numeric(chipo['item_price'])

In [80]:
revenue = (chipo['quantity']* chipo['item_price'])
revenue = revenue.sum()

print('Revenue was: $' + str(np.round(revenue,2)))

Revenue was: $195387.88


### ⚠️np.round 的用法 + number转换为str（）

### How many orders were made in the period?

In [86]:
orders = chipo.order_id.value_counts().count()         
# ！！！⚠️value_counts: returns the frequency count of unique values in the 'order_id' column of the DataFrame chipo
orders

1834

### How many different items are sold?


In [None]:
chipo.item_name.value_counts().count()

50

## PART B: Filtering and Sorting Data

### What is the price of each item?


In [None]:
chipo[(chipo['item_name'] == 'Chicken Bowl') & (chipo['quantity'] == 1)]

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
5,3,1,Chicken Bowl,"[Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou...",$10.98
13,7,1,Chicken Bowl,"[Fresh Tomato Salsa, [Fajita Vegetables, Rice,...",$11.25
19,10,1,Chicken Bowl,"[Tomatillo Red Chili Salsa, [Fajita Vegetables...",$8.75
26,13,1,Chicken Bowl,"[Roasted Chili Corn Salsa (Medium), [Pinto Bea...",$8.49
42,20,1,Chicken Bowl,"[Roasted Chili Corn Salsa, [Rice, Black Beans,...",$11.25
...,...,...,...,...,...
4590,1825,1,Chicken Bowl,"[Roasted Chili Corn Salsa, [Rice, Black Beans,...",$11.25
4591,1825,1,Chicken Bowl,"[Tomatillo Red Chili Salsa, [Rice, Black Beans...",$8.75
4595,1826,1,Chicken Bowl,"[Tomatillo Green Chili Salsa, [Rice, Black Bea...",$8.75
4599,1827,1,Chicken Bowl,"[Roasted Chili Corn Salsa, [Cheese, Lettuce]]",$8.75


### Sort by the name of the item

In [None]:
chipo.item_name.sort_values()

3389    6 Pack Soft Drink
341     6 Pack Soft Drink
1849    6 Pack Soft Drink
1860    6 Pack Soft Drink
2713    6 Pack Soft Drink
              ...        
2384    Veggie Soft Tacos
781     Veggie Soft Tacos
2851    Veggie Soft Tacos
1699    Veggie Soft Tacos
1395    Veggie Soft Tacos
Name: item_name, Length: 4622, dtype: object

### OR

In [None]:
chipo.sort_values(by = "item_name")

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
3389,1360,2,6 Pack Soft Drink,[Diet Coke],$12.98
341,148,1,6 Pack Soft Drink,[Diet Coke],$6.49
1849,749,1,6 Pack Soft Drink,[Coke],$6.49
1860,754,1,6 Pack Soft Drink,[Diet Coke],$6.49
2713,1076,1,6 Pack Soft Drink,[Coke],$6.49
...,...,...,...,...,...
2384,948,1,Veggie Soft Tacos,"[Roasted Chili Corn Salsa, [Fajita Vegetables,...",$8.75
781,322,1,Veggie Soft Tacos,"[Fresh Tomato Salsa, [Black Beans, Cheese, Sou...",$8.75
2851,1132,1,Veggie Soft Tacos,"[Roasted Chili Corn Salsa (Medium), [Black Bea...",$8.49
1699,688,1,Veggie Soft Tacos,"[Fresh Tomato Salsa, [Fajita Vegetables, Rice,...",$11.25


### What was the quantity of the most expensive item ordered? ！！！⚠️注意用法

In [None]:
chipo.sort_values(by = "item_price", ascending = False).head(1)

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
2624,1042,1,Steak Salad Bowl,"[Fresh Tomato Salsa, [Black Beans, Sour Cream,...",$9.39


### How many times was a Veggie Salad Bowl ordered?

In [None]:
chipo_salad = chipo[chipo.item_name == "Veggie Salad Bowl"]
len(chipo_salad)

18

### Trying some different dataset

In [90]:
drinks = pd.read_csv('https://raw.githubusercontent.com/justmarkham/DAT8/master/data/drinks.csv')
drinks.head()

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,Afghanistan,0,0,0,0.0,AS
1,Albania,89,132,54,4.9,EU
2,Algeria,25,0,14,0.7,AF
3,Andorra,245,138,312,12.4,EU
4,Angola,217,57,45,5.9,AF


！！！注意 ⚠️这里groupby和analysis的共同使用

### Which continent drinks more beer on average?

In [91]:
drinks.groupby('continent').beer_servings.mean()

continent
AF     61.471698
AS     37.045455
EU    193.777778
OC     89.687500
SA    175.083333
Name: beer_servings, dtype: float64

### For each continent print the statistics for wine consumption.

In [92]:
drinks.groupby('continent').wine_servings.describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
AF,53.0,16.264151,38.846419,0.0,1.0,2.0,13.0,233.0
AS,44.0,9.068182,21.667034,0.0,0.0,1.0,8.0,123.0
EU,45.0,142.222222,97.421738,0.0,59.0,128.0,195.0,370.0
OC,16.0,35.625,64.55579,0.0,1.0,8.5,23.25,212.0
SA,12.0,62.416667,88.620189,1.0,3.0,12.0,98.5,221.0


### Print the mean alcohol consumption per continent for every column

In [100]:
print(drinks.dtypes)

country                          object
beer_servings                     int64
spirit_servings                   int64
wine_servings                     int64
total_litres_of_pure_alcohol    float64
continent                        object
dtype: object


In [104]:
#drinks['total_litres_of_pure_alcohol']=pd.to_numeric(drinks['total_litres_of_pure_alcohol'])
drinks.groupby('continent').mean()

TypeError: agg function failed [how->mean,dtype->object]

### Print the median alcohol consumption per continent for every column


In [105]:
drinks.groupby('continent').median()

TypeError: agg function failed [how->median,dtype->object]

### Print the mean, min and max values for spirit consumption.

In [106]:
drinks.groupby('continent').spirit_servings.agg(['mean', 'min', 'max'])

Unnamed: 0_level_0,mean,min,max
continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AF,16.339623,0,152
AS,60.840909,0,326
EU,132.555556,0,373
OC,58.4375,0,254
SA,114.75,25,302


### Trying some more different functionalities

In [107]:
csv_url = 'https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/04_Apply/Students_Alcohol_Consumption/student-mat.csv'
df = pd.read_csv(csv_url)
stud_alcoh = df.loc[: , "school":"guardian"]
stud_alcoh.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,reason,guardian
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,course,mother
1,GP,F,17,U,GT3,T,1,1,at_home,other,course,father
2,GP,F,15,U,LE3,T,1,1,at_home,other,other,mother
3,GP,F,15,U,GT3,T,4,2,health,services,home,mother
4,GP,F,16,U,GT3,T,3,3,other,other,home,father


In [109]:
capitalizer = lambda q: q.capitalize()  
#A lambda function in Python is a small anonymous function that can have any number of arguments, but can only have one expression.
#They are defined using the lambda keyword, followed by a list of arguments, a colon, and then the expression to be evaluated.
# Lambda functions are often used when you need a simple function for a short period of time.

In [111]:
stud_alcoh['Mjob'].apply(capitalizer)
stud_alcoh['Fjob'].apply(capitalizer)
stud_alcoh.tail()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,reason,guardian
390,MS,M,20,U,LE3,A,2,2,services,services,course,other
391,MS,M,17,U,LE3,T,3,1,services,services,course,mother
392,MS,M,21,R,GT3,T,1,1,other,other,course,other
393,MS,M,18,R,LE3,T,3,2,services,other,course,mother
394,MS,M,19,U,LE3,T,1,1,other,at_home,course,father


In [112]:
stud_alcoh['Mjob'] = stud_alcoh['Mjob'].apply(capitalizer)
stud_alcoh['Fjob'] = stud_alcoh['Fjob'].apply(capitalizer)
stud_alcoh.tail()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,reason,guardian
390,MS,M,20,U,LE3,A,2,2,Services,Services,course,other
391,MS,M,17,U,LE3,T,3,1,Services,Services,course,mother
392,MS,M,21,R,GT3,T,1,1,Other,Other,course,other
393,MS,M,18,R,LE3,T,3,2,Services,Other,course,mother
394,MS,M,19,U,LE3,T,1,1,Other,At_home,course,father


！！！注意⚠️代码解释：
首先，stud_alcoh['Mjob'].apply(capitalizer) 将 capitalizer 函数应用于 'Mjob' 列的每个元素。但是，这行代码本身并没有更新原DataFrame，它只是返回了转换后的列。
然后，通过 stud_alcoh['Mjob'] = stud_alcoh['Mjob'].apply(capitalizer) 将转换后的列赋值回 stud_alcoh DataFrame，从而更新了 'Mjob' 列。
同样的过程也应用于 'Fjob' 列。

### Here instead of just using the existing the data, we will create our own dataframe/dataseries

！！！注意⚠️

**

pd.Series** is a one-dimensional labeled array-like data structure in pandas. It can hold data of any type (integers, floats, strings, etc.) and is similar to a **one-dimensional NumPy array** or a Python list. However, unlike a NumPy array, a pd.Series can have *custom row labels*, which are referred to as the index.

In [113]:
a = pd.Series([1,2,3])
a

0    1
1    2
2    3
dtype: int64

You can specify custom index labels for a pandas Series by passing a list of index labels to the index parameter when creating the Series.

In [114]:
data = [10, 20, 30]
custom_index = ['A', 'B', 'C']
s = pd.Series(data, index=custom_index)
s

A    10
B    20
C    30
dtype: int64

In [115]:
print('Data passed as a list')
df_list = pd.DataFrame([['May1', 32], ['May2', 35], ['May3', 40], ['May4', 50]])
print(df_list)

Data passed as a list
      0   1
0  May1  32
1  May2  35
2  May3  40
3  May4  50


df_list 使用嵌套列表创建。在这种方式中，每个内部列表代表DataFrame的一行。
由于直接从列表创建，因此除非显式指定，否则Pandas不会自动为列分配具体的名称，它只会使用数字索引（0、1、2等）作为列名。

In [116]:
print('Data passed as dictionary')
df_dict = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]},dtype = float)
print(df_dict)

Data passed as dictionary
     A    B
0  1.0  4.0
1  2.0  5.0
2  3.0  6.0


df_dict 使用字典创建。这里字典的键成为DataFrame的列名，字典的值（一个列表）代表对应列的数据。
使用字典创建DataFrame时，数据的顺序由字典的键（列名）决定，这提供了一种自然的方式来指定列名。

In [117]:
# Rename columns ！！！ 重要注意

df_dict.rename(columns={'A': 'a'})

# inplace by default is false
# if inplace = True is not set then the changes are not made on the original df but only a temp df is made with changes
df_dict

Unnamed: 0,A,B
0,1.0,4.0
1,2.0,5.0
2,3.0,6.0


In [118]:
# changes made for original df（inplace 很重要）

df_dict.rename(columns={'A': 'a'}, inplace=True)
df_dict


Unnamed: 0,a,B
0,1.0,4.0
1,2.0,5.0
2,3.0,6.0


In [119]:
# Reset column names
# Tip: remember to pass the entire list in this case

df_dict.columns = ['a', 'b']
df_dict.head()

Unnamed: 0,a,b
0,1.0,4.0
1,2.0,5.0
2,3.0,6.0


In [121]:
# Defining columns, index during dataframe creation！！ 直接自定义一个dataframe

df_temp = pd.DataFrame([['October 1', 67], ['October 2', 72], ['October 3', 58], ['October 4', 69], ['October 5', 77]], index = ['Day 1', 'Day 2', 'Day 3', 'Day 4', 'Day 5'], columns = ['Month', 'Temperature'])
df_temp


Unnamed: 0,Month,Temperature
Day 1,October 1,67
Day 2,October 2,72
Day 3,October 3,58
Day 4,October 4,69
Day 5,October 5,77


## Practice Questions for Pandas

1. From df filter the 'Manufacturer', 'Model' and 'Type' for every 20th row starting from 1st (row 0).

```
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
```

2. Replace missing values in Min.Price and Max.Price columns with their respective mean.

```
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
```

3. How to get the rows of a dataframe with row sum > 100?

```
df = pd.DataFrame(np.random.randint(10, 40, 60).reshape(-1, 4))
```

In [6]:
#1
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
df
selected_columns=df.loc[::20,['Manufacturer', 'Model', 'Type' ]]
selected_columns

Unnamed: 0,Manufacturer,Model,Type
0,Acura,Integra,Small
20,Chrysler,LeBaron,Compact
40,Honda,Prelude,Sporty
60,Mercury,Cougar,Midsize
80,Subaru,Loyale,Small


In [10]:
#2
min_price_mean=df["Min.Price"].mean()
max_price_mean=df["Max.Price"].mean()
df["Min.Price"].fillna(min_price_mean,inplace=True)
df["Max.Price"].fillna(max_price_mean,inplace=True)

In [13]:
#3
import numpy as np
df2 = pd.DataFrame(np.random.randint(10, 40, 60).reshape(-1, 4))
row_sums=df2.sum(axis=1)
df2
filtered_rows=df2[row_sums>100]
filtered_rows

Unnamed: 0,0,1,2,3
0,22,29,19,38
2,36,26,37,12
4,25,33,29,38
6,31,24,24,35
9,37,25,39,10
10,36,35,37,20
12,23,35,20,27


##

pd.DataFrame(np.random.randint(10, 40, 60).reshape(-1, 4))

It contains random integers between 10 and 40.
There are a total of 60 values.
The values are arranged into a shape that has 4 columns (and thus 15 rows, because 60/4 = 15).
Here's what the code is doing in a Python environment:

np.random.randint(10, 40, 60): Generates 60 random integers between 10 and 40.
.reshape(-1, 4): Reshapes the array into an unknown number of rows (-1 lets numpy determine this) and 4 columns.
pd.DataFrame(...): Converts the array into a pandas DataFrame.
