#### NUMPY PROJECT - WINE QUALITY

- In this Project, we’ll walk through using NumPy to analyze data on wine quality. 
  <br>
- The data contains information on various attributes of wines, such as pH and fixed acidity,<br>
  along with a quality score between 0 and 10 for each wine. 
<br>

- The quality score is the average of at least 3 human taste testers. 

###### CSV File - winequality-red.csv

- The data is in what we are going to call ssv (semicolon separated values) format — <br>
  each record is separated by a semicolon (;), and rows are separated by a new line. 
  <br>

- There are 1600 rows in the file, including a header row, and 12 columns.

###### Reading CSV data

1. We can use NumPy to directly read csv or other files into arrays. 
   <br>
2. We can do this using the 'numpy.genfromtxt' function.
   <br>
3. We can use it to read in our initial data on red wines. 
   <br>
4. In the below code, we:

    - Use the genfromtxt function to read in the winequality-red.csv file.
    - Specify the keyword argument delimiter=";" so that the fields are parsed properly.
    - Specify the keyword argument skip_header=1 so that the header row is skipped.

In [2]:
import numpy as np

In [11]:
wines=np.genfromtxt("winequality-red.csv",delimiter=";",skip_header=1)

print("Values in winequality file:-\n\n",wines)

# wines1=np.genfromtxt("winequality-red.csv",delimiter=";",dtype=str) # With Header values

# print("Values in winequality file:-\n\n",wines1)

Values in winequality file:-

 [[ 7.4    0.7    0.    ...  0.56   9.4    5.   ]
 [ 7.8    0.88   0.    ...  0.68   9.8    5.   ]
 [ 7.8    0.76   0.04  ...  0.65   9.8    5.   ]
 ...
 [ 6.3    0.51   0.13  ...  0.75  11.     6.   ]
 [ 5.9    0.645  0.12  ...  0.71  10.2    5.   ]
 [ 6.     0.31   0.47  ...  0.66  11.     6.   ]]


###### Indexing

- We can use array indexing to select individual elements, groups of elements, or entire rows & columns.
  <br>
- One important thing to keep in mind is that just like python lists, NumPy is zero-indexed, <br>
  meaning that   the index of the first row is '0', and the index of the first column is '0'.
  <br>
  
- Ex:- If we want to work with the fourth row, we would use the index 3.<br>
  if we want to work with the second row,we would use index 1 and so on. 

In [12]:
print("the value at row 3,column 4:-",wines[2,3])

print()

print("the value at row 0,column 0:-",wines[0,0])

the value at row 3,column 4:- 2.3

the value at row 0,column 0:- 7.4


- Let’s select the element at row 3 and column 4. In the above code,
  we pass in the index 2 as the row index, and the index 3 as the column index.
  This retrieves the value from the fourth column of the third row.

    **Note:-** 
    
   - Since we’re working with a 2-dimensional array in NumPy, we specify 2 indexes to retrieve an element.

   - The first index is the row, or axis 1, index, and the second index is the column, or axis 2, index. 
    
   - Any element in wines can be retrieved using 2 indexes.

###### Slicing

- If we instead want to select the first three items from the fourth column,
  we can do it using a colon (:). 
  <br>
- A colon indicates that we want to select all the elements 
  from the starting index up to but not including the ending index. 
  <br>
  
- This is also known as a slice:

In [17]:
W = wines[0:3,3]

print("Retrieving first 3 items from fourth column:-\n",W)

print()

# Just like with list slicing, it’s possible to omit the 0 to just retrieve all the elements
# from the beginning up to element 3:

print("Retrieving all the elements from the beginning up to element 3:-\n\n", wines[:3])

Retrieving first 3 items from fourth column:-
 [1.9 2.6 2.3]

Retrieving all the elements from the beginning up to element 3:-

 [[7.400e+00 7.000e-01 0.000e+00 1.900e+00 7.600e-02 1.100e+01 3.400e+01
  9.978e-01 3.510e+00 5.600e-01 9.400e+00 5.000e+00]
 [7.800e+00 8.800e-01 0.000e+00 2.600e+00 9.800e-02 2.500e+01 6.700e+01
  9.968e-01 3.200e+00 6.800e-01 9.800e+00 5.000e+00]
 [7.800e+00 7.600e-01 4.000e-02 2.300e+00 9.200e-02 1.500e+01 5.400e+01
  9.970e-01 3.260e+00 6.500e-01 9.800e+00 5.000e+00]]


In [20]:
# We can select an entire column by specifying that we want all the elements,
# from the first to the last. 
# We specify this by just using the colon (:), 
# with no starting or ending indices. 

# The below code will select the entire fourth column:

print("Retrieving the entire 4th column values:-", wines[:,3])

Retrieving the entire 4th column values:- [1.9 2.6 2.3 ... 2.3 2.  3.6]


In [38]:
# Extracting an entire row:

print("Retrieving 3rd row values:- \n",wines[3,:])

Retrieving 3rd row values:- 
 [11.2    0.28   0.56   1.9    0.075 17.    60.     0.998  3.16   0.58
 75.     6.   ]


In [25]:
# Extracting both row and column values

print("row and column values:- \n",wines[2,3:9])

row and column values:- 
 [ 2.3    0.092 15.    54.     0.997  3.26 ]


In [27]:
# If we take our indexing to the extreme, 
# we can select the entire array using two colons 
# to select all the rows and columns in wines.

print("Extracting all rows and columns:-\n\n", wines[:,:])

Extracting all rows and columns:-

 [[ 7.4    0.7    0.    ...  0.56   9.4    5.   ]
 [ 7.8    0.88   0.    ...  0.68   9.8    5.   ]
 [ 7.8    0.76   0.04  ...  0.65   9.8    5.   ]
 ...
 [ 6.3    0.51   0.13  ...  0.75  11.     6.   ]
 [ 5.9    0.645  0.12  ...  0.71  10.2    5.   ]
 [ 6.     0.31   0.47  ...  0.66  11.     6.   ]]


###### Assigning values

In [29]:
# We can also use indexing to assign values to certain elements in arrays.

# We can do this by assigning directly to the indexed value. 

print("original value:-",wines[1,1],"\n")

wines[1,1]=200

print("changed value:-",wines[1,1])

original value:- 200.0 

changed value:- 200.0


In [31]:
# We can do the same for slices. To overwrite an entire column:-

print("original column values:-",wines[:,10],"\n")

wines[:,10]=75

print("changed column values:-",wines[:,10])

original column values:- [75. 75. 75. ... 75. 75. 75.] 

changed column values:- [75. 75. 75. ... 75. 75. 75.]


In [36]:
# To overwrite an entire row values:- 

print("original row values:- \n",wines[4,:],"\n")

wines[4,:]=90

print("changed row values:-",wines[4,:])

original row values:- 
 [75. 75. 75. 75. 75. 75. 75. 75. 75. 75. 75. 75.] 

changed row values:- [90. 90. 90. 90. 90. 90. 90. 90. 90. 90. 90. 90.]


## 1-Dimensional NumPy Arrays

- So far, we’ve worked with 2-dimensional arrays, such as wines.
  <br>  
- However, NumPy is a package for working with multidimensional arrays.
  <br>
- One of the most common types of multidimensional arrays is the 1-dimensional array, or vector.
  <br>
- As we may have noticed above, when we sliced wines, we retrieved a 1-dimensional array.
  <br>
- A 1-dimensional array only needs a single index to retrieve an element.
  <br>
- Each row and column in a 2-dimensional array is a 1-dimensional array.
  <br> 
- Just like a list of lists is analogous to a 2-dimensional array, 
  a single list is analogous to a 1-dimensional array. 
  <br>  

- If we slice wines and only retrieve the third row, we get a 1-dimensional array:-

In [44]:
w1=wines[3,:]

print("Extracting 3rd row values:- \n", w1)

print()

# We can retrieve individual elements from "w1" using a single index. 

# The below code will display the fifth item in "w1":

print("Displaying fifth value from w1:-", w1[5])
print()

# The below code will display the eigth item in "w1":

print("Displaying fifth value from w1:-", w1[8])

Extracting 3rd row values:- 
 [11.2    0.28   0.56   1.9    0.075 17.    60.     0.998  3.16   0.58
 75.     6.   ]

Displaying fifth value from w1:- 17.0

Displaying fifth value from w1:- 3.16


# Data types

In [45]:
print("Data type of the values in the file:-", wines.dtype)

Data type of the values in the file:- float64


In [49]:
print("Values in the file are appearing in float:-\n",wines)

Values in the file are appearing in float:-
 [[7.40e+00 7.00e-01 0.00e+00 ... 5.60e-01 7.50e+01 5.00e+00]
 [7.80e+00 2.00e+02 0.00e+00 ... 6.80e-01 7.50e+01 5.00e+00]
 [7.80e+00 7.60e-01 4.00e-02 ... 6.50e-01 7.50e+01 5.00e+00]
 ...
 [6.30e+00 5.10e-01 1.30e-01 ... 7.50e-01 7.50e+01 6.00e+00]
 [5.90e+00 6.45e-01 1.20e-01 ... 7.10e-01 7.50e+01 5.00e+00]
 [6.00e+00 3.10e-01 4.70e-01 ... 6.60e-01 7.50e+01 6.00e+00]]


###### Data type Conversion

- We can use the numpy.ndarray.astype method to convert an array to a different type. 
  <br>
  
- The method will actually copy the array, and return a new array with the specified data type.

In [53]:
print("Converting float to int data type:-\n\n", wines.astype(int))

Converting float to int data type:-

 [[  7   0   0 ...   0  75   5]
 [  7 200   0 ...   0  75   5]
 [  7   0   0 ...   0  75   5]
 ...
 [  6   0   0 ...   0  75   6]
 [  5   0   0 ...   0  75   5]
 [  6   0   0 ...   0  75   6]]


###### NumPy Array Operations

- NumPy makes it simple to perform mathematical operations on arrays. 

- This is one of the primary advantages of NumPy, and makes it quite easy to do computations.

  **Single array math**

  - If you do any of the basic mathematical operations (/, *, -, +, ^) with an array and a value, <br>
    it will apply the operation to each of the elements in the array.

In [61]:
# Here, we want to add 10 points to each quality score, because we’re drunk and feeling generous.
# Here’s how we’d do that:

print("Original values in quality score:-", wines[:,11])

print("\nAdding 10 points to quality score:-", wines[:,11]+10)

# wines[:,11] = wines[:,11]+10

# wines[:,11]

print("\nMultiplying each qualit score by 2:-", wines[:,11]*3)

Original values in quality score:- [35. 35. 35. ... 36. 35. 36.]

Adding 10 points to quality score:- [45. 45. 45. ... 46. 45. 46.]

Multiplying each qualit score by 2:- [105. 105. 105. ... 108. 105. 108.]


**Multi array math:-**

- It's also possible to do mathematical operations between arrays. 
  This will apply the operation to pairs of elements. 

  For example, if we add the quality column to itself, here's what we get:

In [65]:
print("Adding quality score column to itself:-", wines[:,11]+wines[:,11])

print()

# We can also use this to multiply arrays. 

# Here, we want to pick a wine that maximizes alcohol content and quality.
# We'd multiply alcohol by quality, and select the wine with the highest score:

print("Multiplying alcohol by quality score:-", wines[:,10]*wines[:,11])

Adding quality score column to itself:- [70. 70. 70. ... 72. 70. 72.]

Multiplying alcohol by quality score:- [2625. 2625. 2625. ... 2700. 2625. 2700.]


###### NumPy Array Methods

- In addition to the common mathematical operations, </br>
  NumPy also has several methods that, we can use for more complex calculations on arrays.

In [76]:
print("Sum of all the elements in quality column:-", wines[:,11].sum())

print("\nSum of all the elements in alcohol column:-", wines[:,10].sum())

print("\nSum of elements for each column:-\n", wines.sum(axis=0))

# We can verify that we did the sum correctly by checking the shape. 
# The shape should be 12, corresponding to the number of columns:
    
print("\nShape of Sum of elements for each column:-", wines.sum(axis=0).shape)

print("\nSum of elements for each row:-\n", wines.sum(axis=1))

print("\nShape of Sum of elements for each row:-", wines.sum(axis=1).shape)

Sum of all the elements in quality column:- 57067.0

Sum of all the elements in alcohol column:- 119940.0

Sum of elements for each column:-
 [ 13385.7       1132.405      523.29      4147.65       229.783
  25463.       74358.        1682.80014   5380.96      1141.82
 119940.       57067.     ]

Shape of Sum of elements for each column:- (12,)

Sum of elements for each row:-
 [170.1438  417.3748  194.899   ... 194.48174 200.01547 186.49249]

Shape of Sum of elements for each row:- (1599,)


- There are several other methods that behave like the sum method, including:-

    1. **numpy.ndarray.mean** — finds the mean of an array.
     
    2. **numpy.ndarray.std** — finds the standard deviation of an array.<br>
    
    3. **numpy.ndarray.min** — finds the minimum value in an array.<br>
    
    4. **numpy.ndarray.max** — finds the maximum value in an array.

In [88]:
print("Mean of all the elements in the file:-", np.mean(wines))

print("\nMean of all the elements in quality column:-", np.mean(wines[:,11]))

print("\nStandard deviation of all the elements in the file:-", np.std(wines))

print("\nStandard deviation of all the elements in quality column:-", np.std(wines[:,11]))

print("\nMinimum value of all the elements in the file:-", wines.min())

print("\nMinimum value of all the elements in the quality column:-", wines[:,11].min())

print("\nMaxmimum value of all the elements in the file:-", wines.max())

print("\nMaximum value of all the elements in the quality column:-", wines[:,11].max())

Mean of all the elements in the file:- 15.86681301542631

Mean of all the elements in quality column:- 35.68918073796122

Standard deviation of all the elements in the file:- 25.256546304552028

Standard deviation of all the elements in quality column:- 2.258265732659416

Minimum value of all the elements in the file:- 0.0

Minimum value of all the elements in the quality column:- 33.0

Maxmimum value of all the elements in the file:- 289.0

Maximum value of all the elements in the quality column:- 120.0


###### NumPy Array Comparisons

- NumPy makes it possible to test to see, if rows match certain values<br>
  using mathematical comparison operations like <, >, >=, <=, and ==. 
  <br>
  
- For example, if we want to see which wines have a quality rating higher than 5,we can do with these operaters. 

In [93]:
print("wines having quality higher than 5:-", wines[:,11]>5)

print("\nWines having quality score equal to 35:-", wines[:,11]==35)

wines having quality higher than 5:- [ True  True  True ...  True  True  True]

Wines having quality score equal to 35:- [ True  True  True ... False  True False]


###### Subsetting

- One of the powerful things, we can do with a Boolean array and <br>
  a NumPy array is select only certain rows or columns in the NumPy array.
  <br>

- For example, the below code will only select rows in wines where the quality is over 7:

In [33]:
wines=np.genfromtxt("winequality-red.csv",delimiter=";",skip_header=1)

wines

array([[ 7.4  ,  0.7  ,  0.   , ...,  0.56 ,  9.4  ,  5.   ],
       [ 7.8  ,  0.88 ,  0.   , ...,  0.68 ,  9.8  ,  5.   ],
       [ 7.8  ,  0.76 ,  0.04 , ...,  0.65 ,  9.8  ,  5.   ],
       ...,
       [ 6.3  ,  0.51 ,  0.13 , ...,  0.75 , 11.   ,  6.   ],
       [ 5.9  ,  0.645,  0.12 , ...,  0.71 , 10.2  ,  5.   ],
       [ 6.   ,  0.31 ,  0.47 , ...,  0.66 , 11.   ,  6.   ]])

In [98]:
hq=wines[:,11]>7
print("quality higher than 7:-",hq)

hq=wines[:,11]<7
print("\nquality less than 7:-",hq)

Alc=wines[:,10]>10

print("\nalcohol higher than 10:-",Alc)

quality higher than 7:- [ True  True  True ...  True  True  True]

quality less than 7:- [False False False ... False False False]

alcohol higher than 10:- [ True  True  True ...  True  True  True]


In [None]:
We select only the rows, where high_quality contains a True value, and all of the columns. 

This subsetting makes it simple to filter arrays for certain criteria.

For example, we can look for wines with a lot of alcohol and high quality. 

In order to specify multiple conditions,we have to place each condition in parentheses,
and separate conditions with an ampersand (&):

In [112]:
high_quality_and_alcohol = (wines[:,10] > 10) & (wines[:,11] > 7)

print("More alcohol and high quality:-\n\n", wines[high_quality_and_alcohol,10:])

print()

print("Alochol and high quality:-", high_quality_and_alcohol)

More alcohol and high quality:-

 [[75. 35.]
 [75. 35.]
 [75. 35.]
 ...
 [75. 36.]
 [75. 35.]
 [75. 36.]]

Alochol and high quality:- [ True  True  True ...  True  True  True]


###### Reshaping NumPy arrays

- We can change the shape of arrays while still preserving all of their elements.
  <br>
- This often can make it easier to access array elements. 
  <br>
- The simplest reshaping is to flip the axes, so rows become columns, and vice versa. 
  <br>
 
- We can accomplish this with the numpy.transpose function:

In [115]:
print("Shape of the Wines:-",wines.shape)

tp = np.transpose(wines).shape

print("\nReshaping using transpose :-",tp)

Shape of the Wines:- (1599, 12)

Reshaping using transpose :- (12, 1599)


###### ravel

- We can use the numpy.ravel function to turn an array into a one-dimensional representation.
  <br>
  
- It will essentially flatten an array into a long sequence of values.

In [123]:
print("Before reshaping:-\n",wines[10])

tp = wines.ravel()

print("\nReshaping using transpose :-",tp)

print("\nDisplaying 10th value after reshaping:-", tp[10])

Before reshaping:-
 [ 6.7     0.58    0.08    1.8     0.097  15.     65.      0.9959  3.28
  0.54   75.     35.    ]

Reshaping using transpose :- [ 7.4   0.7   0.   ...  0.66 75.   36.  ]

Displaying 10th value after reshaping:- 75.0


In [48]:
array_one = np.array(
    [
        [1, 2, 3, 4],
        [5, 6, 7, 8]
    ]
)
array_one.ravel()

array([1, 2, 3, 4, 5, 6, 7, 8])

###### reshape()

- We can use the numpy.reshape function to reshape an array to a certain shape we specify. 

In [130]:
# The below code will turn the second row of wines into a 2-dimensional array with 3 rows and 4 columns:

print("Second row of wines before reshaping:-\n\n", wines[1,:])

print()

print("Secon row of wines after reshaping:-\n\n", wines[1,:].reshape((3,4)))

Second row of wines before reshaping:-

 [7.800e+00 2.000e+02 0.000e+00 2.600e+00 9.800e-02 2.500e+01 6.700e+01
 9.968e-01 3.200e+00 6.800e-01 7.500e+01 3.500e+01]

Secon row of wines after reshaping:-

 [[7.800e+00 2.000e+02 0.000e+00 2.600e+00]
 [9.800e-02 2.500e+01 6.700e+01 9.968e-01]
 [3.200e+00 6.800e-01 7.500e+01 3.500e+01]]


###### Combining NumPy Arrays

- With NumPy, it’s very common to combine multiple arrays into a single unified array. 
  <br>
  
- **vstack :-** We can use numpy.vstack to vertically stack multiple arrays. 

In [133]:
# We can read in the winequality-white.csv dataset that contains information on the quality of white wines,
# then combine it with our existing dataset, wines, which contains information on red wines.

white_wines=np.genfromtxt("winequality-white.csv",delimiter=";",skip_header=1)

print("Shape of white wines file:-",white_wines.shape)

Shape of white wines file:- (4898, 12)


In [134]:
red_wines=np.genfromtxt("winequality-red.csv",delimiter=";",skip_header=1)

print("Shape of red wines file:-",red_wines.shape)

Shape of red wines file:- (1599, 12)


In [135]:
all_wines=np.vstack((red_wines,white_wines))

print("Shape of all wines:-", all_wines.shape)

Shape of all wines:- (6497, 12)


**hstack**

- If we want to combine arrays horizontally, where the number of rows stay constant,<br>
  but the columns are joined, then we can use the numpy.hstack function. 
  <br>
  
- The arrays we combine need to have the same number of rows for this to work.

In [137]:
print("Shape of red wines file:-",red_wines.shape)
print()
print("Shape of white wines file:-",white_wines.shape)

all_wines=np.hstack((red_wines,white_wines))

all_wines.shape # Displays error as both files doesnot have same number of rows. 

Shape of red wines file:- (1599, 12)

Shape of white wines file:- (4898, 12)


ValueError: all the input array dimensions except for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 1599 and the array at index 1 has size 4898

###### Concatenation

- We can use numpy.concatenate as a general purpose version of hstack and vstack. 
  <br>
  
- If we want to concatenate two arrays, we pass them into concatenate, <br>
  then specify the axis keyword argument that we want to concatenate along. 
  <br>
  
- Concatenating along the first axis is similar to vstack,<br>
  and concatenating along the second axis is similar to hstack:

In [140]:
x=np.concatenate((red_wines,white_wines),axis=0)

print("After Concatenation of files:-\n\n",x)

print("\nShape after concatenation:-", x.shape)

After Concatenation of files:-

 [[ 7.4   0.7   0.   ...  0.56  9.4   5.  ]
 [ 7.8   0.88  0.   ...  0.68  9.8   5.  ]
 [ 7.8   0.76  0.04 ...  0.65  9.8   5.  ]
 ...
 [ 6.5   0.24  0.19 ...  0.46  9.4   6.  ]
 [ 5.5   0.29  0.3  ...  0.38 12.8   7.  ]
 [ 6.    0.21  0.38 ...  0.32 11.8   6.  ]]

Shape after concatenation:- (6497, 12)


In [144]:
x=np.concatenate((red_wines,white_wines),axis=1)

print("After Concatenation of files:-\n\n",x) # Displays error as rows of both files are not matching.

print(x.shape)

ValueError: all the input array dimensions except for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 1599 and the array at index 1 has size 4898