### What is the difference between a NumPy array and a list?

1. What is the difference between a NumPy array and a list?

    We'll cover some common questions about scientific computing in Python. We'll start with NumPy arrays and compare them to Python lists.

2. NumPy array

    It is a special data structure from the NumPy module representing a fundamental package for scientific computing with Python. The easiest way to create an array is to pass a list of values to the array() constructor. At the first glance, there isn't a big difference to Python's native lists.

3. Similarities between an array and a list

    Both data structures are Iterables. In both cases we can use indexing to access elements. 

    Moreover, NumPy arrays and Python lists can be modified similarly. What's so special about NumPy arrays then?

    Compared to lists, NumPy arrays are optimized for high efficiency computations. How? First of all, NumPy arrays only store data of the same type.

4. dtype property
    
    When we have an array, we can retrieve the data type it stores by accessing its .dtype property. In this case, our array stores integers represented by 64 bits.

8. Changing the data type of an element
    
    Compared to lists, if we try to modify an element with a different data type, we'll get ValueError.

9. Specifying the data type explicitly
    
    Actually, we can explicitly specify the data type when we create an array using the dtype keyword argument.

    Independently from the list we pass, we can specify other data types like a string, for example. The output we see here means a one-character string.

        num_array = np.array([1,2,3,4,5], dtype=np.type('str'))
        num_array.dtype
        > dtype('<U1')

11. Object as a data type

    If we want an array to behave like a list with respect to modification, we can use the dtype equal to 'O' which stands for Object. In this case, we can mix data types. However, we also limit the set of operations we can apply to such an array.

        num_array = np.array([1,2,3,4,5], dtype = np.dtype('O'))
    
12. Difference between an array and a list - Accessing items
    
    The second property of NumPy arrays is that they offer a special way to access their elements.
    
    Let's assume we have this two-dimensional list. As an array it can be defined like this. To retrieve a single item, say the 8 in this case, both lists and arrays provide similar options.

        array2d = np.array([
            [1,2,3,4,5],
            [6,7,8,9,10],
            [11,12,13,14,15]
        ])

14. Accessing items

    With arrays though, it's not necessary to specify additional square brackets.

        # Retrieve 8
        array2d[1,2]
        > 8

    But how do we retrieve an entire data block?

    The solution for a list can be tricky.

    An array provides a more elegant and efficient way via slicing.

        array2d[0:2,1:4]
        > array([2,3,4],
                [7,8,9])

18. Difference between an array and a list

    Third, operations work differently on arrays. For simplicity, we'll focus only on numeric arrays.

19. Operations +, -, *, / with lists

    Let's recall that, given two lists, most of the simple mathematical operations will result in TypeError. Addition is an exception; it concatenates given lists.

    In case of NumPy arrays, operations are performed element-wise. As a result, we get a new array.

        num_array1 = np.array([1,2,3])
        num_array2 = np.array([10,20,30])
        num_array1 + num_array2
        > array([11,22,33])
        num_array1*num_array2
        > array([10,40,90])

    The same applies to multidimensional arrays.

22. Conditional operations

    Conditional operations are especially useful. Applying them on an array returns a new array of booleans indicating whether the condition is satisfied or not. The cool part is that we can use these conditions to filter our arrays. This operation takes much more effort with lists.

        num_array = np.array([-5,-4,-3,0,3,4,5])
        num_array[num_array <0]
        > array([-5,-4,-3])

23. Broadcasting

    Another important feature of arrays is broadcasting. It describes how operations work on arrays of different dimensions. For example, what happens if we multiply this array by 3? We certainly know the answer for lists: they get extended. In case of arrays, each element is multiplied by 3 resulting in a new array. The same applies to other operations. We say that 3 broadcasts itself to all the array elements meaning that 3 operates on each element separately.
        
        num_list = [1,2,3]
        num_list * 3
        > [1,2,3,1,2,3,1,2,3]
        
        num_array = np.array([1,2,3])
        num_array * 3
        > array([3,6,9])


24. Broadcasting with multidimensional arrays

    We can do broadcasting with multidimensional arrays. For this example, the one-dimensional array broadcasts itself to all three rows of the two-dimensional array. Broadcasting is applied to rows by default. If we want to broadcast to columns, we need to modify our one-dimensional array to be a column vector. And here's the result.

        array1d = np.array([[1], [2], [3]])
        
26. Let's practice

In [4]:
import numpy as np

In [5]:
# What is the type of the following array?
np.array([1,(2,3),4]).dtype

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (3,) + inhomogeneous part.

#### Accessing subarrays

Let's access elements in NumPy arrays! Your task is to convert a square two-dimensional array square of size size to a list created by following a spiral pattern:

Traversing the matrix in spiral way

Rather than simply accessing certain slices, you will define a more general solution using a for loop (the solution should work for all the square two-dimensional arrays of odd size).

The module numpy is already imported as np.

You will need the reversed() function, which reverses an Iterable.

In [7]:
square = np.array([[ 1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10],
       [11, 12, 13, 14, 15],
       [16, 17, 18, 19, 20],
       [21, 22, 23, 24, 25]])

In [10]:
spiral = []
size = len(square)

for i in range(0, size):
    # Convert each part marked by a red arrow to a list
    spiral += list(square[i, i:size-i])
    # Convert each part marked by a green arrow to a list
    spiral += list(square[i+1:size-i, size-i-1])
    # Convert each part marked by a blue arrow to a list
    spiral += list(reversed(square[size-i-1, i:size-i-1]))
    # Convert each part marked by a magenta arrow to a list
    spiral += list(reversed(square[i+1:size-i-1, i]))
        
print(spiral)

[1, 2, 3, 4, 5, 10, 15, 20, 25, 24, 23, 22, 21, 16, 11, 6, 7, 8, 9, 14, 19, 18, 17, 12, 13]


#### Operations with NumPy arrays

The following blocks of code create new lists given input lists input_list1, input_list2, input_list3 (you can check their values in the console). If you had analogous NumPy arrays with the same values input_array1, input_array2, input_array3 (you can check their values in the console), how would you create similar output as NumPy arrays using the knowledge on broadcasting, accessing element in NumPy arrays, and performing element-wise operations?

Block 1

list(map(lambda x: [5*i for i in x], input_list1))

Block 2

list(filter(lambda x: x % 2 == 0, input_list2))

Block 3

[[i*i for i in j] for j in input_list3]

In [13]:
input_array1 = np.array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])
input_array2 = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
input_array3 = np.array([[1, 2],
       [3, 4],
       [5, 6]])

input_list1 = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
input_list2 = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
input_list3 = [[1, 2], [3, 4], [5, 6]]

In [14]:
# Substitute the code in the block 1 given the input_array1
output_array1 = input_array1 * 5 
print(list(map(lambda x: [5*i for i in x], input_list1)))
print(output_array1)

[[5, 10, 15], [20, 25, 30], [35, 40, 45]]
[[ 5 10 15]
 [20 25 30]
 [35 40 45]]


In [15]:
# Substitute the code in the block 2 given the input_array2
output_array2 = input_array2[ input_array2 % 2 == 0]
print(list(filter(lambda x: x % 2 == 0, input_list2)))
print(output_array2)

[0, 2, 4, 6, 8]
[0 2 4 6 8]


In [16]:
# Substitute the code in the block 3 given the input_array3
output_array3 = input_array3 * input_array3
print([[i*i for i in j] for j in input_list3])
print(output_array3)

[[1, 4], [9, 16], [25, 36]]
[[ 1  4]
 [ 9 16]
 [25 36]]


### How to use the .apply() method on a DataFrame?

1. How to use the .apply() method on a DataFrame?

    Let's move to DataFrames! We'll cover one of the most frequently used methods, .apply().

2. Dataset

    First, let's pick a dataset. We'll work with data on 100 students and their performance on different subjects. Each performance score varies between 0 and 100.

3. Default .apply()

    Let's use the .apply() method. It requires one argument - a function that, by default, is applied on each column of a DataFrame. However, the output of .apply() may differ. For example, applying the sqrt() function results in a DataFrame with square roots of original values.
        
        scores_new = scores.apply(np.sqrt)

4. Default .apply()

    However, using the mean() function returns a Series. Why?

        scores_new = scores.apply(np.mean)
        type(scores_new)
            > pandas.core.series.Series

    The columns we apply the function to are passed as pandas Series. When we use sqrt(), we simply modify each value in a column and return an object of the same size. When we use mean(), we summarize the Series with a single value.

    For example, let's define a function halving our scores. We get a modified DataFrame because passing columns to our defined function results in an object of the same size.

    On the contrary, if we return only one value - for example, a perfect score - we summarize each column by a single value. Therefore, we get pandas Series.

8. Lambda expressions

    Of course, our functions can be substituted with lambda expressions!

        scores_new = scores.apply(lambda x: x/2)

    It will simplify our code with no changes in our output.

10. Additional arguments: axis

    Let's have a look at additional arguments we can pass to the .apply() method. We'll start with the axis argument. which can be either 0, which is default, or 1.

        df.apply(function, axis= )
        axis=0, over columns (default)
        axis=1, over rows
        
    0 means that the function is applied over the columns of a DataFrame, 1 - over the rows. Specifying this argument is useful for functions resulting in a single value like mean().

    Zero implies no difference from the default behavior: we get the mean of each column. 1 implies averaging values in each row instead.

16. Additional arguments: result_type

    The next argument we'll discuss is result_type. We'll consider only some of the values it can take. The first one is expand. To understand it, let's define a function that returns a list with the minimum and the maximum value of the input. When we apply the function to the DataFrame, we get a pandas Series with the corresponding summary for each column. Notice that the list returned by the span() function is considered as a single value summarizing our input, despite the fact that its size is 2. Therefore, the .apply() method results in a pandas Series.

        df.apply(function, result_type= )
        result_type='expand'

    Specifying the keyword argument unwraps our list resulting in the following DataFrame.

    Adding the axis argument and setting it to 1 applies the span() function row-wise and unfolds the list for each row.

    The second useful value for result_type is broadcast. To understand it, let's consider applying the mean() function again.

        result_type='broadcast'

    Specifying broadcasting results in a DataFrame of the original size where each column is filled with the corresponding output from the mean() function.

21. More than one argument in a function

    So far, our functions we used .apply() with had only one argument.But what if we have more arguments including keyword arguments? For example, let's have a function that by default checks if the calculated mean is within a certain interval. If the value of the keyword argument changes to False, then we check an opposite scenario.

        df.apply(function, args= )
        args = [arg1,arg2,..]

        
23. Applying the function

    Let's use .apply() with our function. We get TypeError because we didn't specify its arguments!

24. Additional arguments: args

    They can be specified in the args argument of the .apply() method. It's a list containing positional arguments for our function. Let's try it now. It works! Notice, the values in the list should have the same order as the function arguments. We didn't specify the 'inside' keyword argument, so the function executes with its default value. What if we want to pass another value?

    We can simply insert it afterwards. As expected, setting it to False produces an inverted result.

26. Let's practice!

#### Simple use of .apply()

Let's get some handful experience with .apply()!

You are given the full scores dataset containing students' performance as well as their background information.

Your task is to define the prevalence() function and apply it to the groups_to_consider columns of the scores DataFrame. This function should retrieve the most prevalent group/category for a given column (e.g. if the most prevalent category in the lunch column is standard, then prevalence() should return standard).

The reduce() function from the functools module is already imported.

Tip: pd.Series is an Iterable object. Therefore, you can use standard operations on it.

def prevalence(series):
    vals = list(series)
    # Create a tuple list with unique items and their counts
    itms = [(x, vals.count(x)) for x in set(vals)]
    # Extract a tuple with the highest counts using reduce()
    res = reduce(lambda x, y: x if x[1]>=y[1] else y, itms)
    # Return the item with the highest counts
    return res[0]

# Apply the prevalence function on the scores DataFrame
result = scores[groups_to_consider].apply(prevalence)
print(result)

#### Additional arguments

Let's use additional arguments in the .apply() method!

Your task is to create two new columns in scores:

- mean is the row-wise mean value of the math score, reading score and writing score
- rank defines how high the mean score is:
    - 'high' if the mean value 
    - 'medium' if the mean value > 60 but <= 90
    - 'low' if the mean value 

To accomplish this task, you'll need to define the function rank that, given a series, returns a list with two values: the mean of the series and a string defined by the aforementioned rule.

In [None]:
import numpy as np 

def rank(series):
    # Calculate the mean of the input series
    mean = series.mean()
    # Return the mean and its rank as a list
    if mean > 90:
        return [mean,'high']
    elif mean > 60:
        return [mean,'medium']
    else:
        return [mean,'low']

# Insert the output of rank() into new columns of scores
cols = ['math score', 'reading score', 'writing score']
scores[['mean', 'rank']] = scores[cols].apply(rank, result_type = 'expand', axis=1)
print(scores[['mean', 'rank']].head())

#### Functions with additional arguments

Let's add some arguments to the function definition!

Numeric data in scores represent students' performance scaled between 0 and 100. Your task is to rescale this data to an arbitrary range between low and high. Rescaling should be done in a linear fashion, i.e. for any data point x in a column:

To do rescaling, you'll have to define the function rescale(). Remember, the operation written above can be applied to Series directly. After defining the function, you'll have to apply it to the specified columns of scores.

In [None]:
def rescale(series, low, high):
   # Define the expression to rescale input series
   return series * ((high - low)/ 100) + low

# Rescale the data in cols to lie between 1 and 10
cols = ['math score', 'reading score', 'writing score'] 
scores[cols] = scores[cols].apply(rescale, args=[1,10] )
print(scores[cols].head())

In [None]:
# Redefine the function to accept keyword arguments
def rescale(series, high=0, low=100):
   return series * (high - low)/100 + low

# Rescale the data in cols to lie between 1 and 10
cols = ['math score', 'reading score', 'writing score']
scores[cols] = scores[cols].apply(rescale, kwargs=[1,10])
print(scores[cols].head())

### How to use the .groupby() method on a DataFrame?

1. How to use the .groupby() method on a DataFrame?

2. Dataset

    We'll refer to the dataset on the relationship between personal background factors and the concentrations of plasma B-carotene and plasma retinol in blood. Low concentrations of these compounds have been suggested to be associated with a higher risk of cancer.

3. .groupby()

    The .groupby() method groups the data according to some criteria. We can then perform an operation on each group. The most common way is to group the data by a factor specified by a column name. For example, here we split by one factor, gender, and here - by two, gender and smoking. In both cases the output is a special DataFrameGroupBy object.

    df.groupby(['gender', 'smoking'])

4. Iterating through .groupby() output

    It's possible to iterate through this object. Each item is a tuple with the first element being a grouping factor and the second - the corresponding DataFrame.

    More grouping factors imply more DataFrames. Here, we get as many DataFrames as there are gender / smoking combinations.

6. Standard operations on groups
    
    There are many cool things we can do with groups! For example, we already know that DataFrames and Series provide many standard methods to use. We can select a column and apply a method of interest. For example, .mean() or .count(). We can use the same functionality for groups! Here is the mean for each group. And here is the count of valid values for each group. 

        gens = retinol.groupby('gender')

        gens['plasma retinol'].mean()
        > gender    plasma retinol
        > Female    587.721612
        > Male      700.738095

        gens['vitamin use'].count()
        > gender    vitamin use
        > Female    273
        > Male      42
        
7. The .agg() method
    
    Actually, almost all the DataFrame or Series methods can be applied to DataFrameGroupBy objects. For example, let's recall the .agg() method. It's almost identical to the .apply() method we talked before. By default, it applies a function to each specified column that summarizes it with a single value. For example, to calculate the mean value of the plasma retinol level, we can pass the NumPy mean() function to the method.

        retinol['plasma retinol'].agg(np.mean)
        > 602.790476

    The big difference to the .apply() method is that we can specify several aggregating functions in a list.

        retinol[['plasma B-carotene', 'plasma retinol']].agg([np.mean, np.std])

    As you might have guessed, the .agg() method can be successfully used for DataFrameGroupBy objects.

        gensmoks = retinol.groupby(['gender', 'smoking'])
        gensmoks['plasma retinol'].agg([np.mean, np.std])
        
11. Own functions and lambda expressions
    
    We can, of course, create our own functions. For example, let's count the number of values in a column exceeding the mean value. Here's the corresponding output.
    We can also insert lambda expressions in the .agg() method. Here, we simply calculate the size of the column in a group.

        gens[['plasma B-carotene', 'plasma retinol']].agg([n_more_than_mean, lambda x: len(x)])

13. Renaming the output

    If we use a dictionary instead of a list with functions, the keys will be used as column names.

        gens[['plasma B-carotene', 'plasma retinol']].agg({'count': n_more_than_mean, 'len': lambda x: len(x)})

14. The .transform() method
    
    Another useful DataFrame method is .transform(). It's also almost identical to the .apply() method we already discussed. It modifies the values in each given column by some rule specified in a function. For example, let's have a function that centers and scales the data in a column.

        df.transform(function)

15. DataFrame and the .transform() method

    Here's the modified DataFrame after applying the .transform() method with our function on two columns.

16. .groupby() followed by .transform()

    If we apply the .transform() method on groups, the output will be different because we modify the columns in each group separately. Afterwards, the transformed data is merged into a single DataFrame.

    Of course, instead of a well-defined function, we can also use a lambda expression.

18. The .filter() method of DataFrameGroupBy object

    The last method we discuss is the .filter() method. It filters out groups according to the logical output of the passed function and merges the remaining groups into a new DataFrame. Notice that a function acts on the whole DataFrame in each group. Therefore, we can specify quite complex filters.
    
        df.filter(funtion)

19. .groupby() followed by .filter()

    For example, when we group here, we get 6 groups. Let's have a function that checks if the mean BMI value is higher than 26. When we use it inside .filter(), we get the filtered DataFrame.

        gensmoks = retinol.groupby(['gender', 'smoking'])        
        len(gensmoks)
        > 6
        
        def check_bmi(dataframe):
            return np.mean(dataframe['bmi']) > 26
            
20. .groupby() followed by .filter()
    
    We can check how many groups were filtered out by grouping the filtered data again. Now we have only 3 groups instead of 6.

        retinol_filtered = gensmoks.filter(check_bmi)
        len(reitnol_filtered.groupby(['gender', 'smoking']))
        > 3

#### Standard DataFrame methods

You are given the diabetes dataset storing information on female patients tested for diabetes. You will focus on blood glucose levels and the test results. Subjects, tested positively, usually have higher blood glucose levels after performing the so-called glucose tolerance test. Your task is to investigate whether it is true for this specific dataset.

The plasma glucose column corresponds to the glucose levels. The test result column corresponds to the diabetes test results.

You must use standard DataFrame methods (the numpy module is not imported for you).

In [None]:
# Load the data from the diabetes.csv file
diabetes = pd.read_csv('diabetes.csv')
print(diabetes.info())

# Calculate the mean glucose level in the entire dataset
print(diabetes['plasma glucose'].mean())

# Group the data according to the diabetes test results
diabetes_grouped = diabetes.groupby('test result')

import numpy
# Calculate the mean glucose levels per group
print(diabetes_grouped['plasma glucose'].mean())

#### BMI of villains

Let's return to the heroes dataset containing the information on different comic book heroes. We added a bmi column to the dataset calculated as Weight divided by (Height/100)**2. This index helps define whether an individual has weight problems.

Your task is to find out what is the mean value and standard deviation of the BMI index depending on the character's 'Alignment' and the 'Publisher' whom this character belongs to. However, you'll need to consider only those groups that have more than 10 valid observations of the BMI index.

Tip: use .count() to calculate the number of valid observations.

In [None]:
import numpy as np

# Group the data by two factors specified in the context
groups = heroes.groupby(['Alignment','Publisher'])

# Filter groups having more than 10 valid bmi observations
fheroes = groups.filter(lambda x: x['bmi'].count() > 10)

# Group the filtered data again by the same factors
fgroups = fheroes.groupby(['Alignment','Publisher'])

# Calculate the mean and standard deviation of the BMI index
result = fgroups['bmi'].agg([np.mean,np.std])
print(result)

#### NaN value imputation

Let's try to impute some values, using the .transform() method. In the previous task you created a DataFrame fheroes where all the groups with insufficient amount of bmi observations were removed. Our bmi column has a lot of missing values (NaNs) though. Given two copies of the fheroes DataFrame (imp_globmean and imp_grpmean), your task is to impute the NaNs in the bmi column with the overall mean value and with the mean value per group defined by Publisher and Alignment factors, respectively.

Tip: pandas Series and NumPy arrays have a special .fillna() method which substitutes all the encountered NaNs with a value specified as an argument.

In [None]:
# Define a lambda function that imputes NaN values in series
impute = lambda series: series.fillna(np.mean(series))

# Impute NaNs in the bmi column of imp_globmean
imp_globmean['bmi'] = imp_globmean['bmi'].transform(impute)
print("Global mean = " + str(fheroes['bmi'].mean()) + "\n")

groups = imp_grpmean.groupby(['Publisher', 'Alignment'])

# Impute NaNs in the bmi column of imp_grpmean
imp_grpmean['bmi'] = groups['bmi'].transform(impute)
print(groups['bmi'].mean())

### How to visualize data in Python?

2. matplotlib

    The usual way to proceed is to use the matplotlib module. More precisely, to use its pyplot submodule. Usually, we abbreviate it as plt. We'll consider basic plots such as: scatter plot histogram and boxplot.

        import matplotlib.pyplot as plt

4. Scatter plot

    Let's start with the scatter plot. It's a simple representation of data points in two-dimensional space, given that each point has valid coordinates. Scatter plot is very useful for examining how two numeric variables relate to each other.

5. Create a scatter plot
    
    Given a DataFrame, we can create a scatter plot simply by inserting columns of interest in the scatter() function. The order is important: the first argument corresponds to the horizontal axis, the second - to the vertical one. To show the plot, we have to supply our script with the show() function. This function should always be at the end to complete our plotting activity. Now we see our scatter plot. Can you notice what's wrong with it? Right, it doesn't have neither a title nor labels, which is a very bad practice!

        plt.scatter(x,y)

        plt.show()

6. Create a scatter plot

    To add a title, we can use the title() function.

        plt.title('')

    To add labels for horizontal and vertical axes, we can use the xlabel() and ylabel() function, respectively. It's OK to skip the title, but NEVER forget to label your axes!

        plt.xlabel('')
        plt.ylabel('')

8. Histogram

    Let's move on and meet the histogram! It's a special plot showing how our numerical data is distributed. The horizontal space is divided into so-called bins. The height of a bin indicates how many data points are enclosed in the horizontal space spanned by it. Here, for example, we see that the majority of data points is concentrated around 0.

9. Create a histogram

    Let's create a histogram showing the distribution of the BMI indices in our diabetes data. We need to call the hist() function with the chosen column as an argument.

        plt.hist(x, bins=20)

10. Create a histogram
    
    We can also change the amount of bins used to create a histogram. We just need to use the corresponding keyword argument.

11. Boxplot
    
    Let's move on to boxplots! Like a histogram, a boxplot shows how our numerical data is distributed. Here, 50% of data points are located within the box with the orange line indicating the median value. In turn, the whiskers show the spread of our data. What is outside this range is considered as an outlier. As you can see, boxplots are great when we want to show if there is a difference between groups.

12. Create a boxplot

    To create a boxplot, it's much easier to use the seaborn module rather than the matplotlib. Usually, we abbreviate it as sns. To visualize the data, we have to use the boxplot() function. We have to specify the data source with the data keyword argument and the column names from this source. In this case, the first argument corresponds to the column with test results, which, as it is a factor, is responsible for the amount of boxplots we see. The second argument corresponds to the column with BMI indices, which represents the actual data for each boxplot.

        import seaborn as sns

        sns.boxplot('test_results', 'bmi', data=diabetes)\
        plt.title('Boxplots of BMI per test result')
        plt.show()

13. Create a boxplot
    
    We can precisely define what is plotted against horizontal x axis and against vertical y axis with the corresponding keyword arguments.

14. Create a boxplot

    Changing the order of keyword arguments rotates the boxplot. Finally, notice that with the seaborn module we don't need to specify our axis labels. It's done automatically!

        sns.boxplot(y='test_result', x='bmi', data=diabetes)
        

#### Plot a histogram

Let's further investigate the retinol dataset. Your task now is to create a histogram of the plasma retinol feature.

In [None]:
plt.hist(retinol['plasma retinol'], bins=20)
plt.title('Histogram of Plasma Retinol')

# Add other missing parts to the plot
plt.xlabel('plasma retinol')
plt.ylabel('count')
plt.show()

#### Creating boxplots

Let's get back to our heroes dataset. As we previously discovered, the BMI index is in average much higher for villains than for good characters (taking into account only Marvel and DC publishers). Your task is to plot the corresponding distributions of BMI indices using boxplots.

Tip: to select rows in a DataFrame, for which a specific column follows a certain condition, use this expression dataframe[condition for column_name] (e.g. heroes[heroes['Alignment'] == 'good'] selects rows that have a 'good' Alignment in the heroes dataset).

In [None]:
import seaborn as sns

# Select rows from 'heroes' for which the BMI index < 1000
heroes_filtered = heroes[heroes['bmi'] < 1000]

# Create a new boxplot of BMI indices
sns.boxplot(x='Alignment', y='bmi', data=heroes_filtered)
plt.show()