Welcome to Day 1 of the 5-Day Data Challenge! Today, we're going to be looking at how to read different file formats into R. ([If you're new to R, you might want to look through these introductory lessons.](https://www.kaggle.com/rtatman/getting-started-in-r-first-steps)) Specifically, we're going to:

* Learn about different ways of storing structured data
* Read in .json files
* Read in .txt files
* Read in .xls & .xlsx files

I'll start by introducing each concept or technique, and then you'll get a chance to apply it with an exercise (look for the **Your turn!** section). Ready? Let's get started!

___

**Kernel FAQs:**

* **How do I get started?**   To get started, click the blue "Fork Notebook" button in the upper, right hand corner. This will create a private copy of this notebook that you can edit and play with. Once you're finished with the exercises, you can choose to make your notebook public to share with others. :)

* **How do I run the code in this notebook?** Once you fork the notebook, it will open in the notebook editor. From there you can write code in any code cell (the ones with the grey background) and run the code by either 1) clicking in the code cell and then hitting CTRL + ENTER or 2) clicking in the code cell and the clicking on the white "play" arrow to the left of the cell. If you want to run all the code in your notebook, you can use the double, "fast forward" arrows at the bottom of the notebook editor.

* **How do I save my work?** Any changes you make are saved automatically as you work. You can run all the code in your notebook and save a static version by hitting the blue "Commit & Run" button in the upper right hand corner of the editor. 

* **How can I find my notebook again later?** The easiest way is to go to your user profile (https://www.kaggle.com/replace-this-with-your-username), then click on the "Kernels" tab. All of your kernels will be under the "Your Work" tab, and all the kernels you've upvoted will be under the "Favorites" tab.

___

# Setting up our environment
___

First, let's read in all the packages we're going to use in this kernel. Usually, I'd read in the data here as well, but reading data is the whole point of this kernel, so we'll wait on that for now.

>**Important:** make sure you run this cell first. Otherwise, the libraries I'll be using won't be loaded into your local R session and you'll get errors when you try to run the other cells. 

In [None]:
# libraries we'll need
library(tidyverse) # handy utility functions
library(readxl) # for reading in xl files
library(jsonlite) # for reading in json

# Tabular vs. Hierarchical vs. Raw Text Data
____

If you've used R in the past, or Pandas in Python, you've probably mostly worked with .csv (or "comma separated values") files. This are pretty easy to deal with in R: you can read them into a dataframe using the `read.csv()` function (or `read_csv()` if you're using the tidyverse). But other data formats, in particular .json, .txt and .xls(x), can be a little trickier to read into R. 

Before we dive in to how to handle these different formats, let's quickly talk about some of the different types of data you might encounter. In general, different types of data are stored in different file types because they have different structures. This also means that they are best represented with different data structures once you do read them into R. 

* **Tabular data** is basically a spreadsheet format like what you'd see in Excel or Google Sheets. It has rows and columns. Often, each row will be a single observation and each column will be a specific variable. The same variables will be recorded for each observation (or else you have empty columns in some rows).
    * File formats: .csv, .tsv, .xls, .xlsx
    * R data structure(s): dataframe, matrix (if all numeric)
* **Hierarchical data** is data format where values can be nested within each other. With hierarchical data structures, you can have different information about each observation. You generally want to try to avoid trying to "flatten" hierarchical data into a tabular data structure because it's often not space efficient. For example, if you have a dataset of food products and clothes products, you probably want to know the expiration date for the food and the clothing size for the clothes. In a hierarchical structure you don't need to specify that you don't know the size for the food or the expiration date for the clothes, but in a tabular data structure you would.  As a result, you'd end up with a lot of NA cells that aren't very informative and waste space.
    * File formats: .json, .xml
    * R data structure(s): list
* **Raw text data** is just that: raw text that doesn't have a specific data structure specified in the format of the file. This type of data is also called "unstructured". 
    * File format: .txt
    * R data structure(s): character string

![](https://i.imgur.com/6XIfG0o.png)

**OK, so why does this matter?** The reason I started by talking about the differences between these types of data is because I want to make it clear that there's a bigger difference between these file formats than just the letters at the end. Different file formats also reflect differences in how the underlying data is structured. In some cases you can convert between the different types of files (for example, if you have the same fields for each object in a .json file and no nested values you can safely convert it to a .csv) but generally it's more space efficient to use a data structure that makes sense given the way the original data was organized.

In other words **the first step to reading a dataset into R is to understand how it's organized and what type of data structure you should use to store it**. Fortunately, the functions we'll be working with today are pretty smart about how they handle data and can do a lot of the guessing for us.


## Your turn!

Check out each of these three datasets (hint: look in the Data tab so that you can see a quick preview of each dataset) and decide whether the dataset is tabular, hierarchical or raw text data. 

* [Reddit Memes Dataset: A collection of the latest memes from the various meme subreddits](https://www.kaggle.com/sayangoswami/reddit-memes-dataset/data)
* [Emoji sentiment: Are people that use emoji happier?](https://www.kaggle.com/harriken/emoji-sentiment)
* [Aristo MINI Corpus: 1,197,377 science-relevant sentences drawn from public data](https://www.kaggle.com/allenai/aristo-mini-corpus)

# Json Files
___

JSON files generally represent hierarchical  data structures. Let's take a look at an example file to see what that looks like when we're working with them in R.

This dataset I'm working with here contains .json files with information on different movies, including their ratings across different platforms. Let's read it in using the `read_json` function from the `jsonlite` package. 

In [None]:
movie_ratings <- read_json("../input/rating-vs-gross-collector/2018-2-4.json")

Now that we've read it in, let's look at part of the structure of the data:

In [None]:
# Since when we look at the structure of a list we see the whole list, I'm 
# going to save that output to a variable and print just a few rows of it.
# You can see the whole structure by just running str(movie_ratings).
json_structure <- capture.output(str(movie_ratings))
print(json_structure[1:16])

By looking at the very first line, we can see that this is a list with 1 item in it. The second line tells us that this one item is itself a list, with 38 items (in this case, one for each movie). We can pull out individual entries from a list by using double bracket notation.

> **What are double brackets?** Double brackets (they look like this: [[]]) let you pull out a single item from an object, either by its name or its index. You'll see them most often used with lists, but you can also use them with dataframes instead of the \$ notation. `dataframe[['column']]`  is the same things as `dataframe$column`. 

Since in this data structure we have a list inside another list, we're going to need to use two sets of double brackets to get information on a single movie. In the example below, the first double bracket says that we're looking at the first object in the outer list, and the second double bracket says that we want to look at the information in the fifth object (in this case, the fifth movie).

In [None]:
# Get the information from the fifth movie in the list
movie_ratings[[1]][[5]]

From there, we can use the dollar sign notation to get the value for each key in our inner list. So this next bit of code will get us the information on the Gross for the fifth movie from our list:

In [None]:
# Get just the Gross from the fifth movie
movie_ratings[[1]][[5]]$Gross

At this point, you may be wondering *why can't we just get the Gross value for all the movies in our list and then convert it to a column in a dataframe*? The reason is that, in this case, not every movie has a Gross value recorded for it. (Remember that, for hierarchical data, unlike tabular data, there's no requirement that every value is recorded for every observation, so it's possible that you might have different variables for different observations.) We can see this by looping through the first three movies in our list and trying to get the value for "Gross" for each of them.

In [None]:
# print the value for the Gross key for the first 3 movies
for(i in 1:3){
    print(movie_ratings[[1]][[i]]$Gross)
}

We can see from the output that there were recorded values for "Gross" for the first and second movie (even though the value was "unknown" for the second movie). The NULL result, however, tells us that there was no "Gross" information for the third movie at all! If you look through some of the other movies, you can see that there is a lot of information that isn't recorded for every movie, and it's not entirely clear why this is the case. (We'll talk about some strategies for handling missing information tomorrow!)

If we did want to convert this to a tabular data structure, we'd need to have a column for every single different value recorded for each observation. This might lead to having a very large data frame that's mostly empty! We'd also need to figure out how to handle the observations where no value was recorded. It's much simpler to represent this dataset with a hierarchical data structure of nested lists instead.

## Your turn!

Read in the file `2018-2-8.json` from the "Ratings vs Gross" dataset . Check out its structure and see if the same information is recorded for each observation.

For an extra challenge, you can try reading in the .json file using the `read_json` file with the argument `simplifyVector = T`. This will attempt to change your list into a dataframe. What is the resulting data structure? Does it look the way that you expected? 

In [1]:
# your code goes here :)
import json


with open("../input/rating-vs-gross-collector/2018-2-4.json",'r') as load_f:
    load_dict = json.load(load_f)
    
    print(load_dict)
    type(load_dict)
    load_dict
    #print(load_dict[[1]])
    #print(load_dict[['12 Strong']])

# Txt files
___

The first thing I'd recommend doing with a .txt file is to open it to make sure it's actually raw text. If the file's on Kaggle, you can just look at the Data tab for that dataset ([like this one](https://www.kaggle.com/rtatman/character-encoding-examples/data)) to see what it looks like. If you're working locally, you can try opening it in a text editor ([like Gedit](https://github.com/GNOME/gedit)) or printing a few lines of the file to your console using `head`. Looking at part of your file will tell you if you're really working with raw text, or something else that's just been saved with a .txt file extension.

Once you've verified that, yes, you actually do have raw text, the next step is to read it into R. I personally prefer to read text files in line-by-line, partly because this is usually what I do in Python and I like to keep my workflows in the two language as parallel as possible. To read in files by lines, I use the `read_lines`function from the tidyverse collection of packages. When I'm reading in a file for the first time, I generally just read in the first few lines to make sure everything works. (This makes sure I don't have to wait for a huge file to load into memory only to find out that the first line is corrupted!)

For this example, I'm reading in a text file of stop words (short, common words commonly removed during text preprocessing) from Afrikaans. 

In [None]:
# Read in the first 10 lines from the file af.txt
af_stopwords <- read_lines("../input/stopword-lists-for-african-languages/af.txt", n_max = 10)

# check out the structure of the data we've read in
str(af_stopwords)

As you can see, our data has been read in as a character vector, with each line as a separate item in that vector. We can double check that our data has read in correctly by printing it out:

In [None]:
# print out all our words
af_stopwords

And that's it for reading in raw text data! As you can see, its less fiddly tham .json data because you don't need to worry about maintaining the structure of the data you're reading in.

## Your turn!

Read in all the words from the isiZulu stoplist (zu.txt) and make sure it looks good to you. 

For an extra challenge, you can try alphabetizing the list of stop words. (They're currently sorted by how frequently they occur in the language.)

In [2]:
# your code goes here :)
f = open("../input/stopword-lists-for-african-languages/zu.txt")
print(f.read())
f.close()

# Xlsx files
___


Finally, we come to .xlsx or .xls files. These are proprietary file formats generated by Microsoft Excel, the popular spreadsheet editing software. Of the file formats we've talked about so far, these can be the easiest to read into R. Why? Because they store data in a tabular way, and R is really excellent at handling tabular data.

> **When will you run into trouble with .xls files?** In general, the prettier the original spreadsheet looks, the more information you'll lose when you read it into R. In particular, files with multiple spreadsheets in different tabs, lots of color coding and empty cells separating different subtables are more difficult to read into R. 

To read in .xslx files, I like the `read_excel` function from the `readxl` package. It works just like `read.csv` or `read_csv`. To test it out, let's read in an .xls file from a dataset of air quality measures:

In [None]:
# read in our .xls file
air_quality_data <- read_excel("../input/air-quality-data-earlwood-nsw-australia/Earlwood_Air_Data_17_18.xls")

# check out its structure
str(air_quality_data)

Looks good so far! Let's just double-check that our data looks reasonable by checking out the first few rows of the dataframe. 

In [None]:
head(air_quality_data)

Yep, that still looks good to me! Now it's  your turn to try reading in an Excel file. 

## Your turn!

`read_excel` works for both .xlsx & .xls files. Read in the `njs2016_dd_en.xlsx` file from the Canada National Justice Survey 2016 dataset using read_excel. Check its structure and some of the data to make sure it looks the way you expect.

In [27]:
# your code goes here
#!/usr/bin/env python
# -*- coding:utf-8 -*-
import openpyxl
wb=openpyxl.load_workbook('../input/national-justice-survey-2016/njs2016_dd_en.xlsx')  #打开excel文件
print(wb.get_sheet_names())  #获取工作簿所有工作表名

sheet=wb.get_sheet_by_name('Sheet1')  #获取工作表
print(sheet.title) 

sheet02=wb.get_active_sheet()  #获取活动的工作表
print(sheet02.title)
sheets = wb.get_sheet_names()         #从名称获取sheet  
booksheet = wb.get_sheet_by_name(sheets[0])  
  
rows = booksheet.rows  
columns = booksheet.columns  
#迭代所有的行  
for row in rows:  
    line = [col.value for col in row] 
    print(line)
  
#通过坐标读取值  
cell_11 = booksheet.cell('A1').value  
cell_11 = booksheet.cell(row=1, column=1).value  
'''

import numpy as np
import pandas as pd
import os
# helpful character encoding module
import chardet

print(os.listdir("../input"))
print(os.listdir("../input/national-justice-survey-2016"))
# look at the first ten thousand bytes to guess the character encoding
with open("../input/national-justice-survey-2016/njs2016_dd_en.xlsx", 'rb') as rawdata:
    result = chardet.detect(rawdata.read(720))

# check what the character encoding might be
print(result)
#train_df = pd.read_csv("../input/national-justice-survey-2016/njs2016_dd_en.xlsx",encoding='ISO-8859-1')
#train_df = pd.read_csv("../input/national-justice-survey-2016/njs2016_dd_en.xlsx",encoding='Windows-1253')
#train_df = pd.read_csv("../input/national-justice-survey-2016/njs2016_dd_en.xlsx", encoding="ISO-8859-1")
#train_df = pd.read_csv("../input/national-justice-survey-2016/njs2016_dd_en.xlsx", encoding="ISO-8859-1")
'''

# And that's it for Day 1!

___

And that's it for today! If you have any questions, be sure to post them in the comments below or [on the forums](https://www.kaggle.com/questions-and-answers).

Remember that your notebook is private by default, and in order to share it with other people or ask for help with it, you'll need to make it public. First, you'll need to save a version of your notebook that shows your current work by hitting the "Commit & Run" button. (Your work is saved automatically, but versioning your work lets you go back and look at what it was like at the point you saved it. It also lets you share a nice compiled notebook instead of just the raw code.) Then, once your notebook is finished running, you can go to the Settings tab in the panel to the left (you may have to expand it by hitting the [<] button next to the "Commit & Run" button) and setting the "Visibility" dropdown to "Public".