In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

## Reading data files 

Start with the [Kaggle Pandas microcourse](https://www.kaggle.com/residentmario/creating-reading-and-writing) if you aren't familiar with any of the following concepts: 

- pandas DataFrames and Series 
- CSV files 
- the `read_csv()` function. 

If you are, then great! We'll proceed with loading the dataset. 

We used some of the function parameters here to clean our dataset somewhat before working with it: 
- the `index_col` argument marks the `Review #` column as the index column, since we won't be working with it.
- the `na_values` argument marks ramen reviews `Unrated` and `\n` as NA values so that the Stars column will be treated as numeric. 

In [None]:
ramen = pd.read_csv(
    "../input/ramen-ratings/ramen-ratings.csv", 
    index_col="Review #", 
    na_values=["Unrated", "\n"]
)

## Inspecting a dataset 

Describe the DataFrame using the `.describe()` method: 

In [None]:
ramen.describe(include = "all").T

The `.T` attribute transposes the data frame so it's easier to read. Try using the `.describe()` method while omitting `.T`. 

Check the data types of each column: 

In [None]:
ramen.dtypes

Use the `.info()` method to quickly inspect missing values:

In [None]:
ramen.info()

Note that there are 5 non-numeric columns in this dataset save one (Stars). 

## Indexing 

There are many ways to index a DataFrame. Here, we'll use the `.loc[]` method.

To use `.loc[]`, first bear in mind that the dimensions for a DataFrame are similar to those of a matrix: rows first, then columns. 

`.loc[]` allows you to use either names (labels) or logicals to index a DataFrame.

For example, to index a column, you would use: 


```python 
ramen.loc[:, "Stars"]
```

Note that when I use `:` in the first dimension, all rows are returned. 

To index certain rows, pass a logical array. 

The `.isnull()` method returns an array of logicals that tells you whether the value is null. 

For example, run the following code: 

In [None]:
ramen.loc[:, "Top Ten"].isnull()

Using the `.sum()` method, we can tell that most of the 2,580 ramen in this dataset don't have values in the `Top Ten` column: 

In [None]:
ramen.loc[:, "Top Ten"].isnull().sum()

Filter the DataFrame for ramen for which the value is not NA. We can do this using the previous code, combining it with the negation operator `~`:

In [None]:
ramen.loc[~(ramen["Top Ten"].isnull()), ]

Let's inspect the `Country` column: 

In [None]:
ramen.Country.unique()

Note that you can also access a column as an attribute to return a Series, e.g. `ramen.Country`. 

Sarawak is a state in Malaysia, not a country. Change values in the Country column with Sarawak to Malaysia:

In [None]:
ramen.loc[ramen.Country == "Sarawak", "Country"] = "Malaysia"
ramen.Country.unique()

## Visualizing categorical data 

The `value_counts()` method returns unique values and counts of categorical data. For example, to count values in the `Country` column, use the `value_counts()` method on the Series you're interested in: 


```python
ramen.loc[:, "Country"].value_counts()
```

and to plot a bar chart, chain `.plot.bar()`: 

```python
ramen.loc[:, "Country"].value_counts().plot.bar()
```

Try this for yourself: 

- Plot a bar chart to summarize the Style column 
- Plot a bar chart of the top 20 brands in the Brand column

In [None]:
ramen.loc[:, "Style"].value_counts().plot.bar()

In [None]:
ramen.loc[:, "Brand"].value_counts().head(20).plot.bar()

## Grouped statistics

To obtain the average rating throughout your dataset, you could use the `.mean()` method on the Stars column:

In [None]:
ramen.Stars.mean()

But what if you wanted to get the rating by Brand? 

To do this, we need to first obtain a **grouped DataFrame**. The data itself is unchanged, but some attributes are added to the DataFrame so it knows that statistics should be computed in groups. 

In [None]:
type(ramen.groupby("Brand"))

Then, use the `.mean()` method: 

In [None]:
ramen.groupby("Brand").mean()

Note that it automatically finds the numeric columns for you. 

You can use the `sort_values()` method to sort the sorted values: 

In [None]:
ramen.groupby("Brand").mean().sort_values("Stars")

Try this out on `Country` and `Variety`. 

In [None]:
ramen.groupby("Country").mean().sort_values("Stars")

In [None]:
ramen.groupby("Variety").mean().sort_values("Stars", ascending=False)

## Cross-tabulations 

Cross-tabulation tables are important to view relationships between categorical variables. There's no method for this; you'll have to use the `crosstab()` function from pandas:

In [None]:
pd.crosstab(ramen.Country, ramen.Style)