Python语言是近年来数据分析、机器学习领域比较热门的语言之一，特别是在机器学习和深度学习领域应用很广。谷歌的tensorflow，脸书的Pytorch，蒙特利尔大学的Theano都主流深度学习平台均提供了Python接口；各大学也纷纷将Python语言作为编程入门的语言。Python有几个特点：
- 免费。免费的开发平台对于囊中羞涩中小企业、教学科研课题组来说，都是经济上可行的首选项。当年Matlab也是红极一时，但是由于价格较高，一般课题组支撑不起，所以很多学者转到Python， R语言上。
- 完善功能包。有句话讲，“人生苦短，我用Python”，Python里面有数据挖掘工具包Pandas，专业绘图工具matplotlib和seaborn，深度学习工具包tensorflow，大数据处理包PySpark，游戏开发工具包Pygame，图形开发工具PyQt，爬虫工具scrapy和beautifulsoup等等。基本上常用的工具包都可以在库里找到。
- 强大的社区。很多学者、编程开发者对Python进行维护，也有大量论坛等进行交流。


Python 开发效率比较高，但是对于计算量大、并行要求高的工程项目来讲，可能效能不如C++、Java等工程语言，虽然有很多工程界人士一直在此方面进行努力。在一些线上产品开发上，大企业依然偏向于使用C++、Java来写底层算法提供效率。

之前大家都学习过Python，有所接触，比如链表list，字典dictionary等。今天介绍一些可以提高编写Python程序效率的小技巧。

## Iterable Vs Iterator

### Iterable:
- 可迭代的。如链表、字典和文件连接，可以通过循环读取数据。




In [29]:
#聊表的迭代
numbers = [1, 2, 3, 4, 5]
for num in numbers:
    print(num)

1
2
3
4
5


In [30]:
#字典的迭代
country_dict = {'CN': 'China', 'US': 'United States of America', 'UK': 'United Kingdom of Great Britain'}
for key, value in country_dict.items():
    print(key+':'+value)

CN:China
US:United States of America
UK:United Kingdom of Great Britain


In [48]:
#遍历文件每一行
with open('shakespeare.txt', 'r') as file:
    for line in file:
        print(line)
        #print(line.decode('utf-8', errors='ignore').encode('utf-8').strip())

William Shakespeare was an English poet, playwright, and actor, widely regarded as the greatest writer in the English language and the world's pre-eminent dramatist.[2] He is often called England's national poet, and the "Bard of Avon".[3][nb 2] His extant works, including collaborations, consist of approximately 38 plays,[nb 3] 154 sonnets, two long narrative poems, and a few other verses, some of uncertain authorship. His plays have been translated into every major living language and are performed more often than those of any other playwright.[4]



Shakespeare was born and brought up in Stratford-upon-Avon, Warwickshire. At the age of 18, he married Anne Hathaway, with whom he had three children: Susanna, and twins Hamnet and Judith. Sometime between 1585 and 1592, he began a successful career in London as an actor, writer, and part-owner of a playing company called the Lord Chamberlain's Men, later known as the King's Men. He appears to have retired to Stratford around 1613, at ag


### Iterator：
- 迭代器。可以通过next方法产生下一组输出，迭代器里面元素全部释放完后就会为空。

可以通过iter函数将链表变成迭代器

In [49]:
num_iterator = iter(numbers)

In [50]:
num_iterator

<list_iterator at 0x10ab277f0>

In [51]:
next(num_iterator)

1

In [52]:
next(num_iterator)

2

In [53]:
next(num_iterator)

3

In [54]:
num_iterator = iter(numbers)
print(*num_iterator)

1 2 3 4 5


### Enumerate

In [55]:
#Enumerate
countries = ['China', 'USA', 'UK', 'Japan', 'Russia']
e = enumerate(countries) 

In [56]:
e

<enumerate at 0x10aba0240>

In [57]:
list(e)

[(0, 'China'), (1, 'USA'), (2, 'UK'), (3, 'Japan'), (4, 'Russia')]

In [58]:
#Enumerate
e = enumerate(countries) 
for index, num in e:
    print(index, num)

0 China
1 USA
2 UK
3 Japan
4 Russia


### Zip

In [59]:
countries = ['China', 'USA', 'UK', 'Japan', 'Russia']
pops = [13, 3, 0.8, 1.2, 1.2]

如果想构造一个字典，以国家名为key，以人口数为value。 我们可以使用循环。

In [60]:
country_pop_dict = {}
for index in range(len(countries)):
    country_pop_dict[countries[index]] = pops[index]

In [61]:
country_pop_dict

{'China': 13, 'Japan': 1.2, 'Russia': 1.2, 'UK': 0.8, 'USA': 3}

还可以使用功能强大的zip函数。

In [62]:
zipped = zip(countries, pops)
zipped

<zip at 0x10ab8bb48>

In [63]:
for k, v in zipped:
    print(k, v)

China 13
USA 3
UK 0.8
Japan 1.2
Russia 1.2


In [64]:
zipped = zip(countries, pops)
list(zipped)

[('China', 13), ('USA', 3), ('UK', 0.8), ('Japan', 1.2), ('Russia', 1.2)]

In [65]:
zipped = zip(countries, pops)
dict(zipped)

{'China': 13, 'Japan': 1.2, 'Russia': 1.2, 'UK': 0.8, 'USA': 3}

## List/Dict Comprehensions
在Python中，以及C++ java中我们可以使用for，while进行循环。但是python还提供了comprehension方法，能够简化代码编写。比如，我们需要把下面链表中每个公司名都统一成小写模式，一种常见的方法就是循环。

In [66]:
corps = ['Huawei', 'Lenovo', 'Dell', 'toyoto', 'Boeing']
corps_lower = []
for c in corps:
    corps_lower.append(c.lower())

In [67]:
corps_lower

['huawei', 'lenovo', 'dell', 'toyoto', 'boeing']

另外一种方法就是 comprehension。

In [68]:
corps_lower = [c.lower() for c in corps]
corps_lower

['huawei', 'lenovo', 'dell', 'toyoto', 'boeing']

比如，对每个元素加1。

In [69]:
numbers = [1, 3, 5, 7, 9]
numbers_plus_one = [num+1 for num in numbers]
numbers_plus_one

[2, 4, 6, 8, 10]

还可以进行条件筛选，比如挑出长度多于8个字符的人名。

In [70]:
names= ['Richard White', 'Tom Fox', 'Mary Baltimore', 'Sydney Bush', 'William Smiths']
names_longer = [name for name in names if len(name)>8]
names_longer

['Richard White', 'Mary Baltimore', 'Sydney Bush', 'William Smiths']

也可以使用comprehension生成字典，比如统计单词的长度并创建字典。


In [71]:
words = ['hello', 'good', 'beautiful', 'dangerous', 'yell']
word_len_dict = {word:len(word) for word in words}
word_len_dict

{'beautiful': 9, 'dangerous': 9, 'good': 4, 'hello': 5, 'yell': 4}

## Generator(生成器）

上面提到了链表的comprehension：

In [72]:
temp = [num*2 for num in range(10)]
temp

[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

如果我们把方括号换成圆括号，那么返回的就不再是链表而是生成器。生成器在机器学习中经常用到，特别是训练数据的生成，在进行训练时每次生成一组数据。在迭代次数很大时，循环可能会导致系统宕机卡死，而用生成器可以避免这种情况

In [73]:
temp = (num*2 for num in range(10))
temp

<generator object <genexpr> at 0x10ab93ca8>

In [74]:
#生成器也可以进行循环遍历
for num in temp:
    print(num)

0
2
4
6
8
10
12
14
16
18


In [75]:
#生成器可以通过next函数生成值
temp = (num*2 for num in range(10))
print(next(temp))
print(next(temp))


0
2


In [76]:
# 生成器还可以通过函数来定义，只不过最后返回值时不用return而是用yield
def num_sequence(n):
    """Generate values from 0 to n."""
    i= 0
    while i < n:
        yield i 
        i += 1

In [77]:
temp = num_sequence(6)
next(temp)

0

In [78]:
next(temp)

1

In [79]:
next(temp)

2

## Lambda方法

python中传奇的lambda实际上就是函数定义方法。我们可以用def自定义函数，也可以使用lambda。但是建议使用def函数，能够比较完整的进行注释，异常处理。

In [135]:
#加法
def add(a, b):
    if isinstance(a, (int, float)) and isinstance(b, (int,  float)):
        return a+b
    else:
        raise ValueError('Input Type Not Correct!')

In [136]:
add(3,4)

7

In [137]:
#当函数功能比较简单时
add = lambda a, b: a + b
add(3,4)

7

## 应用案例

下面，我们利用上面学习到的方法，对文本数据进行分词，统计单词数量并给出排序后的结果。

In [83]:
#遍历文件每一行
with open('shakespeare.txt', 'r') as file:
    texts = file.readlines()

In [84]:
texts

['William Shakespeare was an English poet, playwright, and actor, widely regarded as the greatest writer in the English language and the world\'s pre-eminent dramatist.[2] He is often called England\'s national poet, and the "Bard of Avon".[3][nb 2] His extant works, including collaborations, consist of approximately 38 plays,[nb 3] 154 sonnets, two long narrative poems, and a few other verses, some of uncertain authorship. His plays have been translated into every major living language and are performed more often than those of any other playwright.[4]\n',
 '\n',
 "Shakespeare was born and brought up in Stratford-upon-Avon, Warwickshire. At the age of 18, he married Anne Hathaway, with whom he had three children: Susanna, and twins Hamnet and Judith. Sometime between 1585 and 1592, he began a successful career in London as an actor, writer, and part-owner of a playing company called the Lord Chamberlain's Men, later known as the King's Men. He appears to have retired to Stratford arou

In [85]:
type(texts)

list

In [86]:
texts[0]

'William Shakespeare was an English poet, playwright, and actor, widely regarded as the greatest writer in the English language and the world\'s pre-eminent dramatist.[2] He is often called England\'s national poet, and the "Bard of Avon".[3][nb 2] His extant works, including collaborations, consist of approximately 38 plays,[nb 3] 154 sonnets, two long narrative poems, and a few other verses, some of uncertain authorship. His plays have been translated into every major living language and are performed more often than those of any other playwright.[4]\n'

In [87]:
type(texts[0])

str

In [88]:
#去除标点符号
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [89]:
#将标点符号映射成空格
trantab_punc = str.maketrans(string.punctuation,' '*len(string.punctuation))
trantab_digit = str.maketrans('0123456789',' '*10)

In [90]:
texts[0].translate(trantab_punc).translate(trantab_digit).strip('\n')

'William Shakespeare was an English poet  playwright  and actor  widely regarded as the greatest writer in the English language and the world s pre eminent dramatist     He is often called England s national poet  and the  Bard of Avon      nb    His extant works  including collaborations  consist of approximately    plays  nb        sonnets  two long narrative poems  and a few other verses  some of uncertain authorship  His plays have been translated into every major living language and are performed more often than those of any other playwright    '

In [91]:
#转换成小写
remove_punc_digit = lambda text: text.translate(trantab_punc).translate(trantab_digit).strip('\n')
texts = [remove_punc_digit(text).lower().strip() for text in texts]

In [92]:
texts

['william shakespeare was an english poet  playwright  and actor  widely regarded as the greatest writer in the english language and the world s pre eminent dramatist     he is often called england s national poet  and the  bard of avon      nb    his extant works  including collaborations  consist of approximately    plays  nb        sonnets  two long narrative poems  and a few other verses  some of uncertain authorship  his plays have been translated into every major living language and are performed more often than those of any other playwright',
 '',
 'shakespeare was born and brought up in stratford upon avon  warwickshire  at the age of     he married anne hathaway  with whom he had three children  susanna  and twins hamnet and judith  sometime between      and       he began a successful career in london as an actor  writer  and part owner of a playing company called the lord chamberlain s men  later known as the king s men  he appears to have retired to stratford around       a

In [93]:
#每个链表的文本进行分词
texts  = [text.split() for text in texts]

In [134]:
texts[0][:10]

['william',
 'shakespeare',
 'was',
 'an',
 'english',
 'poet',
 'playwright',
 'and',
 'actor',
 'widely']

In [99]:
#将所有的单词合并成一个集合
temp = []
_ = [temp.extend(text) for text in texts]

In [100]:
len(temp)

399

In [101]:
temp[:10]

['william',
 'shakespeare',
 'was',
 'an',
 'english',
 'poet',
 'playwright',
 'and',
 'actor',
 'widely']

In [104]:
#Count the frequencies of each word
word_freq_dict = {}
for word in temp:
    word_freq_dict[word] = word_freq_dict.get(word, 0) + 1
    #if word not in word_freq_dict.keys():
        #word_freq_dict[word] = 0
    #else:
        #word_freq_dict[word] += 1

In [105]:
word_freq_dict

{'a': 6,
 'about': 2,
 'accuracy': 1,
 'actor': 2,
 'actors': 1,
 'adapted': 1,
 'age': 3,
 'all': 2,
 'also': 1,
 'an': 3,
 'and': 24,
 'anne': 1,
 'any': 1,
 'appearance': 1,
 'appears': 1,
 'approximately': 1,
 'are': 3,
 'around': 1,
 'as': 9,
 'at': 2,
 'attributed': 1,
 'authorship': 1,
 'avon': 2,
 'bard': 1,
 'been': 2,
 'began': 1,
 'beliefs': 1,
 'ben': 1,
 'best': 1,
 'between': 2,
 'born': 1,
 'brought': 1,
 'but': 2,
 'by': 3,
 'called': 2,
 'career': 1,
 'centuries': 1,
 'chamberlain': 1,
 'children': 1,
 'collaborated': 1,
 'collaborations': 1,
 'collected': 1,
 'comedies': 1,
 'company': 1,
 'condell': 1,
 'considerable': 1,
 'considered': 1,
 'consist': 1,
 'constantly': 1,
 'contexts': 1,
 'cultural': 1,
 'definitive': 1,
 'died': 1,
 'diverse': 1,
 'dramatic': 1,
 'dramatist': 1,
 'during': 1,
 'early': 1,
 'edition': 1,
 'editions': 1,
 'eminent': 1,
 'england': 1,
 'english': 3,
 'ever': 1,
 'every': 1,
 'extant': 1,
 'fellow': 1,
 'few': 2,
 'finest': 1,
 'first':

In [113]:
#Comprehension
freq_word_sets = [(v, k) for k, v in word_freq_dict.items()]
freq_word_sets[:10]

[(1, 'william'),
 (7, 'shakespeare'),
 (3, 'was'),
 (3, 'an'),
 (3, 'english'),
 (2, 'poet'),
 (2, 'playwright'),
 (24, 'and'),
 (2, 'actor'),
 (1, 'widely')]

In [118]:
freq_word_sets.sort(reverse=True)
freq_word_sets[:10]

[(24, 'and'),
 (16, 'of'),
 (15, 'the'),
 (12, 'in'),
 (11, 'his'),
 (9, 'as'),
 (8, 'he'),
 (7, 'shakespeare'),
 (6, 's'),
 (6, 'plays')]

In [121]:
#zip方法
freq_word_sets = zip(word_freq_dict.values(), word_freq_dict.keys())

In [123]:
freq_word_sets = list(freq_word_sets)
freq_word_sets.sort(reverse=True)
freq_word_sets[:12]

[(24, 'and'),
 (16, 'of'),
 (15, 'the'),
 (12, 'in'),
 (11, 'his'),
 (9, 'as'),
 (8, 'he'),
 (7, 'shakespeare'),
 (6, 's'),
 (6, 'plays'),
 (6, 'a'),
 (5, 'works')]

In [124]:
#lambda方法
help(sorted)

Help on built-in function sorted in module builtins:

sorted(iterable, key=None, reverse=False)
    Return a new list containing all items from the iterable in ascending order.
    
    A custom key function can be supplied to customize the sort order, and the
    reverse flag can be set to request the result in descending order.



In [127]:
sorted(word_freq_dict.items(), key=lambda item: item[1], reverse=True)

[('and', 24),
 ('of', 16),
 ('the', 15),
 ('in', 12),
 ('his', 11),
 ('as', 9),
 ('he', 8),
 ('shakespeare', 7),
 ('s', 6),
 ('plays', 6),
 ('a', 6),
 ('works', 5),
 ('known', 4),
 ('was', 3),
 ('an', 3),
 ('english', 3),
 ('language', 3),
 ('nb', 3),
 ('two', 3),
 ('other', 3),
 ('some', 3),
 ('have', 3),
 ('are', 3),
 ('age', 3),
 ('with', 3),
 ('to', 3),
 ('which', 3),
 ('were', 3),
 ('by', 3),
 ('poet', 2),
 ('playwright', 2),
 ('actor', 2),
 ('regarded', 2),
 ('writer', 2),
 ('world', 2),
 ('is', 2),
 ('often', 2),
 ('called', 2),
 ('avon', 2),
 ('including', 2),
 ('few', 2),
 ('been', 2),
 ('performed', 2),
 ('more', 2),
 ('stratford', 2),
 ('at', 2),
 ('three', 2),
 ('between', 2),
 ('men', 2),
 ('later', 2),
 ('king', 2),
 ('about', 2),
 ('produced', 2),
 ('work', 2),
 ('wrote', 2),
 ('published', 2),
 ('all', 2),
 ('but', 2),
 ('william', 1),
 ('widely', 1),
 ('greatest', 1),
 ('pre', 1),
 ('eminent', 1),
 ('dramatist', 1),
 ('england', 1),
 ('national', 1),
 ('bard', 1),
 ('e

In [130]:
#使用Counter工具包
from collections import Counter
word_freq = Counter(temp)

In [131]:
word_freq

Counter({'a': 6,
         'about': 2,
         'accuracy': 1,
         'actor': 2,
         'actors': 1,
         'adapted': 1,
         'age': 3,
         'all': 2,
         'also': 1,
         'an': 3,
         'and': 24,
         'anne': 1,
         'any': 1,
         'appearance': 1,
         'appears': 1,
         'approximately': 1,
         'are': 3,
         'around': 1,
         'as': 9,
         'at': 2,
         'attributed': 1,
         'authorship': 1,
         'avon': 2,
         'bard': 1,
         'been': 2,
         'began': 1,
         'beliefs': 1,
         'ben': 1,
         'best': 1,
         'between': 2,
         'born': 1,
         'brought': 1,
         'but': 2,
         'by': 3,
         'called': 2,
         'career': 1,
         'centuries': 1,
         'chamberlain': 1,
         'children': 1,
         'collaborated': 1,
         'collaborations': 1,
         'collected': 1,
         'comedies': 1,
         'company': 1,
         'condell': 1,
         'c

In [133]:
sorted(word_freq.items(), key=lambda item: item[1], reverse=True)

[('and', 24),
 ('of', 16),
 ('the', 15),
 ('in', 12),
 ('his', 11),
 ('as', 9),
 ('he', 8),
 ('shakespeare', 7),
 ('s', 6),
 ('plays', 6),
 ('a', 6),
 ('works', 5),
 ('known', 4),
 ('was', 3),
 ('an', 3),
 ('english', 3),
 ('language', 3),
 ('nb', 3),
 ('two', 3),
 ('other', 3),
 ('some', 3),
 ('have', 3),
 ('are', 3),
 ('age', 3),
 ('with', 3),
 ('to', 3),
 ('which', 3),
 ('were', 3),
 ('by', 3),
 ('poet', 2),
 ('playwright', 2),
 ('actor', 2),
 ('regarded', 2),
 ('writer', 2),
 ('world', 2),
 ('is', 2),
 ('often', 2),
 ('called', 2),
 ('avon', 2),
 ('including', 2),
 ('few', 2),
 ('been', 2),
 ('performed', 2),
 ('more', 2),
 ('stratford', 2),
 ('at', 2),
 ('three', 2),
 ('between', 2),
 ('men', 2),
 ('later', 2),
 ('king', 2),
 ('about', 2),
 ('produced', 2),
 ('work', 2),
 ('wrote', 2),
 ('published', 2),
 ('all', 2),
 ('but', 2),
 ('william', 1),
 ('widely', 1),
 ('greatest', 1),
 ('pre', 1),
 ('eminent', 1),
 ('dramatist', 1),
 ('england', 1),
 ('national', 1),
 ('bard', 1),
 ('e