# Looking at Data
**Sanket Dave**

---
 Whenever you're working with a new dataset, the first thing you should do is
 look at it! 
 - What is the format of the data? 
 - What are the dimensions? 
 - What are the variable names? 
 - How are the variables stored? 
 - Are there missing data? 
 - Are there any flaws in the data?


 This lesson will teach you how to answer these questions and more using R's
 built-in functions. We'll be using a dataset constructed from the United
 States Department of Agriculture's PLANTS Database
 (http://plants.usda.gov/adv_search.html).
 
 ---

In [2]:
getwd()

In [3]:
plants <- read.csv('../Data/plants.csv')

Now that we've imported data,  Let's begin by checking the class of the plants variable with class(plants).
 This will give us a clue as to the overall structure of the data.

In [5]:
class(plants)

In [6]:
dim(plants)

In [7]:
# returns rows

nrow(plants)

In [8]:
#returns columns

ncol(plants)

In [9]:
#returns size of data

object.size(plants)

706848 bytes

 Now that we have a sense of the shape and size of the dataset, let's get a
 feel for what's inside. names(plants) will return a character vector of column
 (i.e. variable) names. Give it a shot.

In [10]:
names(plants)

We've applied fairly descriptive variable names to this dataset, but that
 won't always be the case. A logical next step is to peek at the actual data.
 However, our dataset contains over 5000 observations (rows), so it's
 impractical to view the whole thing all at once.

The head() function allows you to preview the top of the dataset. Give it a
 try with only one argument.


In [11]:
head(plants)

X,Scientific_Name,Duration,Active_Growth_Period,Foliage_Color,pH_Min,pH_Max,Precip_Min,Precip_Max,Shade_Tolerance,Temp_Min_F
1,Abelmoschus,,,,,,,,,
2,Abelmoschus esculentus,"Annual, Perennial",,,,,,,,
3,Abies,,,,,,,,,
4,Abies balsamea,Perennial,Spring and Summer,Green,4.0,6.0,13.0,60.0,Tolerant,-43.0
5,Abies balsamea var. balsamea,Perennial,,,,,,,,
6,Abutilon,,,,,,,,,


 Take a minute to look through and understand the output above. Each row is
 labeled with the observation number and each column with the variable name.
 Your screen is probably not wide enough to view all 10 columns side-by-side,
 in which case R displays as many columns as it can on each line before
 continuing on the next.


 By default, head() shows you the first six rows of the data. You can alter
 this behavior by passing as a second argument the number of rows you'd like to
 view. Use head() to preview the first 10 rows of plants.

In [12]:
head(plants,10)

X,Scientific_Name,Duration,Active_Growth_Period,Foliage_Color,pH_Min,pH_Max,Precip_Min,Precip_Max,Shade_Tolerance,Temp_Min_F
1,Abelmoschus,,,,,,,,,
2,Abelmoschus esculentus,"Annual, Perennial",,,,,,,,
3,Abies,,,,,,,,,
4,Abies balsamea,Perennial,Spring and Summer,Green,4.0,6.0,13.0,60.0,Tolerant,-43.0
5,Abies balsamea var. balsamea,Perennial,,,,,,,,
6,Abutilon,,,,,,,,,
7,Abutilon theophrasti,Annual,,,,,,,,
8,Acacia,,,,,,,,,
9,Acacia constricta,Perennial,Spring and Summer,Green,7.0,8.5,4.0,20.0,Intolerant,-13.0
10,Acacia constricta var. constricta,Perennial,,,,,,,,


 The same applies for using tail() to preview the end of the dataset. Use
 tail() to view the last 15 rows.

In [15]:
tail(plants,15)

Unnamed: 0,X,Scientific_Name,Duration,Active_Growth_Period,Foliage_Color,pH_Min,pH_Max,Precip_Min,Precip_Max,Shade_Tolerance,Temp_Min_F
5152,5152,Zizania,,,,,,,,,
5153,5153,Zizania aquatica,Annual,Spring,Green,6.4,7.4,30.0,50.0,Intolerant,32.0
5154,5154,Zizania aquatica var. aquatica,Annual,,,,,,,,
5155,5155,Zizania palustris,Annual,,,,,,,,
5156,5156,Zizania palustris var. palustris,Annual,,,,,,,,
5157,5157,Zizaniopsis,,,,,,,,,
5158,5158,Zizaniopsis miliacea,Perennial,Spring and Summer,Green,4.3,9.0,35.0,70.0,Intolerant,12.0
5159,5159,Zizia,,,,,,,,,
5160,5160,Zizia aptera,Perennial,,,,,,,,
5161,5161,Zizia aurea,Perennial,,,,,,,,


 After previewing the top and bottom of the data, you probably noticed lots of
 NAs, which are R's placeholders for missing values. Use summary(plants) to get
 a better feel for how each variable is distributed and how much of the dataset
 is missing.


In [16]:
summary(plants)

       X                            Scientific_Name              Duration   
 Min.   :   1   Abelmoschus                 :   1   Perennial        :3031  
 1st Qu.:1292   Abelmoschus esculentus      :   1   Annual           : 682  
 Median :2584   Abies                       :   1   Annual, Perennial: 179  
 Mean   :2584   Abies balsamea              :   1   Annual, Biennial :  95  
 3rd Qu.:3875   Abies balsamea var. balsamea:   1   Biennial         :  57  
 Max.   :5166   Abutilon                    :   1   (Other)          :  92  
                (Other)                     :5160   NA's             :1030  
           Active_Growth_Period      Foliage_Color      pH_Min     
 Spring and Summer   : 447      Dark Green  :  82   Min.   :3.000  
 Spring              : 144      Gray-Green  :  25   1st Qu.:4.500  
 Spring, Summer, Fall:  95      Green       : 692   Median :5.000  
 Summer              :  92      Red         :   4   Mean   :4.997  
 Summer and Fall     :  24      White-Gray  

 summary() provides different output for each variable, depending on its class.
 For numeric data such as Precip_Min, summary() displays the minimum, 1st
 quartile, median, mean, 3rd quartile, and maximum. These values help us
 understand how the data are distributed.

 For categorical variables (called 'factor' variables in R), summary() displays
 the number of times each value (or 'level') occurs in the data. For example,
 each value of Scientific_Name only appears once, since it is unique to a
 specific plant. In contrast, the summary for Duration (also a factor variable)
 tells us that our dataset contains 3031 Perennial plants, 682 Annual plants,
 etc.

 You can see that R truncated the summary for Active_Growth_Period by including
 a catch-all category called 'Other'. Since it is a categorical/factor
 variable, we can see how many times each value actually occurs in the data
 with table(plants$Active_Growth_Period).

In [18]:
table(plants$Active_Growth_Period)


Fall, Winter and Spring                  Spring         Spring and Fall 
                     15                     144                      10 
      Spring and Summer    Spring, Summer, Fall                  Summer 
                    447                      95                      92 
        Summer and Fall              Year Round 
                     24                       5 

 Each of the functions we've introduced so far has its place in helping you to
 better understand the structure of your data. However, we've left the best for
 last....

 Perhaps the most useful and concise function for understanding the *str*ucture
 of your data is str(). Give it a try now.


In [19]:
str(plants)

'data.frame':	5166 obs. of  11 variables:
 $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Scientific_Name     : Factor w/ 5166 levels "Abelmoschus",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ Duration            : Factor w/ 8 levels "Annual","Annual, Biennial",..: NA 4 NA 7 7 NA 1 NA 7 7 ...
 $ Active_Growth_Period: Factor w/ 8 levels "Fall, Winter and Spring",..: NA NA NA 4 NA NA NA NA 4 NA ...
 $ Foliage_Color       : Factor w/ 6 levels "Dark Green","Gray-Green",..: NA NA NA 3 NA NA NA NA 3 NA ...
 $ pH_Min              : num  NA NA NA 4 NA NA NA NA 7 NA ...
 $ pH_Max              : num  NA NA NA 6 NA NA NA NA 8.5 NA ...
 $ Precip_Min          : int  NA NA NA 13 NA NA NA NA 4 NA ...
 $ Precip_Max          : int  NA NA NA 60 NA NA NA NA 20 NA ...
 $ Shade_Tolerance     : Factor w/ 3 levels "Intermediate",..: NA NA NA 3 NA NA NA NA 2 NA ...
 $ Temp_Min_F          : int  NA NA NA -43 NA NA NA NA -13 NA ...


 The beauty of str() is that it combines many of the features of the other
 functions you've already seen, all in a concise and readable format. At the
 very top, it tells us that the class of plants is 'data.frame' and that it has
 5166 observations and 10 variables. It then gives us the name and class of
 each variable, as well as a preview of its contents.

 str() is actually a very general function that you can use on most objects in
 R. Any time you want to understand the structure of something (a dataset,
 function, etc.), str() is a good place to start.

 In this lesson, you learned how to get a feel for the structure and contents
 of a new dataset using a collection of simple and useful functions. Taking the
 time to do this upfront can save you time and frustration later on in your
 analysis.

Source : swirlstats